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ABSTRACT 


This  work  describes  a  scaleable,  high-performance,  pipelined,  vector  processor 
architecture.  Special  emphasis  is  placed  on  performing  fast  Fourier  transforms  with 
mixed-radix  butterfly  operations.  The  initial  motivation  for  the  architecture  was  the 
computation  of  cyclostationary  algorithms.  However,  the  resulting  architecture  is  capable 
of  general-purpose  vector  processing  as  well.  A  major  factor  affecting  the  performance 
of  the  architecture  is  the  memory  system  design.  The  use  of  pipelining  techniques, 
coupled  with  vector  processing,  places  a  substantial  burden  on  the  memory  system 
performance.  The  memory  design  is  based  on  an  interleaved  memory  philosophy  with  a 
buffering  technique  referred  to  as  split  transaction  memory  (STM).  A  crucial  aspect  of 
the  memory  design  is  the  memory  decoding  scheme.  A  design  methodology  is  described 
for  the  specification  of  permutation  matrices  that  yield  near-optimal  performance  for  the 
memory  system.  Another  important  aspect  of  this  work  is  the  development  of  a  software 
based  simulator  that  allows  a  STM  to  be  specified.  The  simulator,  operating  at  the 
register  transfer  level,  emulates  the  processing  of  an  address  stream  by  STM  and  records 
the  events  for  post-processing.  The  STM  simulator  was  used  to  evaluate  three  types  of 
vector  processing  address  patterns:  constant  stride,  constant  geometry  radix-r  butterfly, 
and  digit  reversed.  A  random  address  pattern  was  also  analyzed  in  the  context  of  general- 
purpose  computing.  STM  simulation  verified  the  near-optimal  performance  of  the  STM. 
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I.  INTRODUCTION 


A.  BACKGROUND 


This  research  began  with  an  investigation  of  computer  architectures  for  computing 
digital  implementations  of  the  Spectral  Correlation  Function  (SCF),  the  central  function  of 
spectral  correlation  or  cyclostationary  analysis.  The  SCF  is  defined  as: 
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and  T  is  the  length  of  a  time  window  for  Equation  (1.2).  The  variable  /  is  called  the 
spectral  location  parameter  and  corresponds  to  the  frequency  parameter  of  a  Fourier 
transform  pair.  It  is  expressed  as 
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a  is  the  spectral  separation  parameter  representing  a  frequency  of  second-order 
periodicity,  a ,  also  referred  to  as  the  cycle  frequency,  is  expressed  as 
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Xj{t,f)  is  the  Fourier  transform  of  the  time  series  signal  x(t}  of  length  T  centered 
at  time  t. 


The  SCF  of  most  man-made  signals  result  in  non-zero  cycle  frequencies.  An 
example  of  a  magnitude  plot  of  a  SCF  for  a  binary  phase  shift  keyed  (BPSK)  signal  is 
shown  in  Figure  I.l.  Each  non-zero  line  (called  di  feature)  in  the  plot  corresponds  to  a 
unique  value  of  a .  The  traditional  power  spectral  density  is  a  special  case  of  spectral 
correlation  analysis  (i.e.,  the  line  for  a  =  0).  The  power  spectral  density  is  the  feature  at 
the  back  of  the  plot.  Three  smaller  cycle  features  as  well  as  a  large  cycle  feature  can  be 
seen  in  the  plot.  Unlike  the  power  spectral  density,  noise  present  in  other  cycle 
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frequencies  will  not  correlate,  and  with  sufficient  averaging,  will  yield  a  feature 
regardless  of  the  noise  level.  This  provides  a  means  for  detecting  a  weak  signal  hidden  in 
noise. 


Several  digital  algorithms  have  been  developed  for  estimating  the  SCF  including 
the  Frequency  Smoothing  Method  (FSM)  and  the  time  smoothed  variants  FFT 
Accumulation  Method  (FAM)  and  the  Strip  Spectral  Correlation  Algorithm  (SSCA)  [Ref 
2]  [Ref  3].  Each  of  these  algorithms  are  heavily  based  on  vector  processing  in  general 
and  FFT  techniques  in  particular.  The  computational  complexity  of  these  algorithms  has 
been  extensively  analyzed.  Applications  for  cyclostationary  analysis  can  be  found  in 
Gardner  [Ref  1]  and  Gardner  [Ref  4].  The  computational  complexity  for  the  SSCA  will 
be  discussed  further  in  Chapter  0,  Section  F.  A  computer  designed  to  exploit  spectral 
correlation  features  is  referred  to  as  a  spectral  correlation  analyzer  (SCA). 


A  variety  of  architectures  were  investigated  for  computing  the  SCF  including 
networks  of  general-purpose  computers,  digital  signal  processing  (DSP)  architectures, 
and  more  specialized  architectures  based  on  vector  processing  techniques. 

For  example,  the  SSCA  was  implemented  on  a  network  of  Sun  workstations 
connected  with  an  Ethernet.  The  software  used  to  facilitate  communications  and  control 
of  the  distributed  application  was  Parallel  Virtual  Machine  (PVM).  It  was  found  that  the 
SSCA  could  be  partitioned  in  such  a  way  to  permit  effective  parallel  execution  on 
numerous  workstations.  This  provides  a  means  for  computing  a  computationally 
intensive  instance  of  the  SSCA  during  off  peak  hours  of  the  computing  facilities. 

The  primary  focus  however  was  to  find  computer  architectures  that  would 
compute  the  SCF  in  real  or  near-real  time.  Transputers  were  examined  to  determine 
feasibility  of  real  time  and  near-real  time  computation  for  the  fast  Fourier  transforms 
(FFTs)  in  particular  and  spectral  correlation  algorithms  in  general.  The  Transputer  is  a 
general-purpose  processor  that  contains  support  for  quick  context  switching  and 
communications  on  chip.  It  is  designed  to  be  scaleable  and  is  a  valid  technology  for 
many  application  domains.  However,  the  number  of  Transputers  that  would  be  required 
to  provide  the  needed  computation  was  found  to  be  too  many  for  a  reasonable 
implementation  for  this  application. 

Highly  specialized  architectures  have  also  been  considered  for  several 
cyclostationary  algorithms.  Architectures  for  both  frequency  and  time  smoothing 
algorithms  may  be  found  in  Roberts  [Ref  5]  and  [Ref  6].  These  architectures  are  based  on 
mapping  hardware  onto  the  algorithmic  requirements  thereby  providing  architectures  that 
can  yield  optimal  performance.  A  practical  disadvantage  to  this  approach  is  the  reduced 
cost  effectiveness  of  a  hardware  implementation  that  is  dedicated  to  a  particular 
algorithm. 

B.  VECTOR  ARCHITECTURE 

Another  architecture  reviewed  was  based  on  vector  array  processors.  This 
approach,  the  subject  of  this  dissertation,  is  based  on  streaming  data  through  a  highly 
pipelined  vector  processor  with,  in  the  ideal  case,  no  wait  states.  The  basic  concept  can 
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be  used  to  build  highly  optimized  architectures  for  many  but  not  all  of  the  functions 
needed  in  a  SCA  (i.e.,  those  portions  that  can  be  vectorized).  Alternatively,  this  basic 
approach  can  be  used  to  build  a  more  generalized  vector  processor  that  might  be  used  for 
any  problem  that  lends  itself  to  vector  processing.  This  more  generalized  approach  will 
be  referred  to  hereafter  as  the  butterfly  machine  architecture.  The  butterfly  machine 
(BFM)  architecture  can  also  be  scaled.  An  architecture  designed  with  multiple  vector 
processors  will  also  be  described  and  is  referred  to  as  the  parallel  butterfly  machine 
architecture.  This  name  is  not  to  be  confused  with  the  BBN  Butterfly  by  BBN  Advanced 
Computers  [Ref  7]. 

Given  the  technology  available  today,  the  key  issue  to  consider  when  evaluating 
the  butterfly  machine  architecture  is  the  requirements  of  the  memory  system.  The 
streaming  of  data  through  the  pipelined  vector  processor  requires  a  data  reference  from 
each  vector  each  clock  cycle.  A  typical  vector  operation  requires  two  input  vectors  and 
creates  one  output  vector,  therefore  implying  three  data  references  per  clock  cycle  per 
processor.  Given  that  the  vector  processor  is  pipelined,  the  clock  rate  applied  to  the 
processor  will  be  on  the  higher  end  of  the  scale  available  with  current  technology. 
Multiple  memory  references  per  clock  cycle  and  a  high  clock  rate  suggest  that  designing  a 
memory  to  accommodate  this  requirement  is  a  primary  area  of  concern. 

As  will  be  seen  in  Chapter  0,  the  butterfly  machine  architecture  calls  for  several 
large  memories  for  each  vector  processor.  Given  the  data  rate  requirements  stated  above, 
such  a  memory  system  could  be  accommodated  by  using  fast  static  random  access 
memory  (SRAM).  This  is  not  a  desirable  solution  because  SRAMs  are  much  more 
expensive  per  bit  relative  to  the  dynamic  random  access  memory  (DRAM)  alternative. 
Secondary  factors  favoring  a  bulk  storage  approach  such  as  DRAM  include  their  need  for 
less  power  and  circuit  board  real-estate.  The  issue  of  cost  becomes  more  acute  when 
considering  the  parallel  butterfly  machine  architecture.  Therefore,  a  cost  effective 
implementation  of  the  butterfly  machine  architecture  will  use  bulk  storage  technology 
such  as  DRAM  instead  of  SRAM  given  the  current  technology  base. 
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DRAM  has  been  the  memory  technology  for  implementing  main  memory  in 
general-purpose  computers  for  some  time.  Almost  any  general-purpose  computer 
acquired  today  will  have  a  memory  system  that  is  composed  of  a  main  memory  consisting 
of  DRAM  technology  coupled  with  one  or  two  levels  of  SRAM-based  cache  memory. 
However,  vector  array  machines  frequently  rely  on  some  form  of  banked  interleaved 
memory  (i.e.,  a  memory  system  consisting  of  parallel  memories  that  attempts  to  exploit 
the  parallelism  to  increase  throughput).  The  relative  merits  of  cache  versus  interleaved 
memory  techniques  for  a  memory  system  will  be  discussed  in  detail  in  Chapter  II. 

C.  PROBLEM  STATEMENT 

This  dissertation  describes  a  computer  architecture  that  is  optimized  for  vector 
processing  in  general  and  cyclostationary  processing  in  particular.  The  memory  system 
design  is  the  key  component  of  this  architecture  for  the  reasons  discussed  in  Section  B 
above. 

There  are  two  characteristics  of  the  butterfly  machine  environment  that 
distinguish  it  from  a  general-purpose  computing  environment  and  have  a  substantial 
effect  on  the  solution  to  the  memory  system.  First,  the  memory  references  are  very  dense 
when  compared  to  the  general-purpose  computing  case.  As  indicated  in  the  discussion 
above,  a  data  reference  is  required  for  each  vector  on  each  cycle.  This  imposes  a 
requirement  of  the  memory  system  that  is  more  stringent  than  would  be  expected  for  a 
general-purpose  computer. 

The  second  characteristic  of  the  butterfly  machine  environment  is  that  all  memory 
addresses  are  known  before  the  first  elements  of  the  vector  are  processed.  Therefore,  a 
memory  reference  stream  can  be  generated  with  certainty  for  instructions  and  data  to  be 
executed  in  the  future.  This  implies  that  substantial  latency  can  be  tolerated  given  that 
the  vector  length  is  long  relative  to  the  latency.  Note  that  this  is  in  sharp  contrast  to  the 
general-purpose  computing  architecture  where  very  little  latency  can  be  tolerated  without 
having  a  substantial  impact  on  performance.  It  will  be  shown  how  this  latency  is  traded 
for  memory  bandwidth  using  interleaved  memory. 
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Two  aspects  of  the  butterfly  machine  architecture  diminish  the  usefulness  of 
traditional  caching  techniques.  First,  since  memory  must  operate  at  the  same  speed  as  the 
processor,  there  are  no  “processing  only”  cycles  that  can  be  used  for  loading  the  cache  in 
parallel.  This  becomes  increasingly  more  important  when  the  size  of  the  cache  lines  are 
large  with  respect  to  the  bus  size.  Additionally,  address  reference  patterns  associated 
with  vector  processing  often  do  not  meet  the  locality  of  reference  criteria  needed  for  a 
memory  system  using  a  cache. 

The  primary  objective  of  this  dissertation  is  to  define  a  low-cost  attached  vector 
processor  architecture  that  is  well  suited  for  cyclostationary  analysis.  In  particular,  this 
architecture  will  perform  fast  Fourier  transforms  (FFTs)  and  other  vector  operations  to 
include  complex  addition  and  multiplication.  By  low  cost,  it  is  meant  that  the  vector 
processor  architecture  is  compatible  with  workstations  rather  than  mainframes  or 
supercomputers.  A  key  component  of  this  architecture  is  a  memory  system  that 
incorporates  low-cost  bulk  storage  memory.  Although  DRAMs,  the  current  choice  given 
today’s  technology  continue  to  increase  in  capability,  their  access  speeds  are  slower  than 
microprocessors  by  as  much  as  a  factor  of  ten  or  more.  This  research  addresses  the 
design  of  a  memory  system,  based  primarily  on  relatively  slow  bulk  storage  devices,  that 
will  provide  memory  bandwidth  that  is  sufficient  to  maintain  optimum  processor 
performance.  The  technique  used  to  construct  such  a  memory  is  referred  to  as  Split 
Transaction  Memory  (STM). 

STM  will  also  be  analyzed  in  the  context  of  general-purpose  computing.  This 
investigation  into  general-purpose  computing  provides  a  more  comprehensive 
understanding  of  the  use  of  STM  for  other  computing  environments. 

Architectures  for  computing  FFTs  have  been  studied  since  the  late  sixties.  One  of 
the  earliest  works  is  by  Pease  [Ref  8].  A  hardwired  signal  processor  for  radar 
applications  is  described  by  Groginsky  [Ref  9].  Another  developed  by  Lincoln  Labs 
Massachusetts  Institute  of  Technology,  is  found  in  Filip  [Ref  10].  This  processor  is 
designed  using  multiple  microprocessors  that  communicate  via  a  bus.  Two  methods  for 
resolving  the  bit-reversal  problem  are  discussed  by  Dieffenderfer  [Ref  11].  Another 
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hardwired  processor,  using  a  radix-4  butterfly,  is  presented  by  Corinthios  [Ref  12].  This 
processor  supports  real-time  applications  transforming  256-point  vectors  with  signal 
sampling  rates  up  to  1.6  million  samples  a  second.  A  VLSI  architecture  is  proposed  by 
Sapiecha  [Ref  13].  This  architecture  consists  of  two  and  three  dimensional  arrays  of 
processor  elements.  Two  real-time  processors  include  a  Winograd  Fourier  transform 
processor  presented  by  Sommer  [Ref  14]  and  a  processor  designed  for  synthetic-aperture- 
radar  applications  Franceschetti  [Ref  15]. 

The  work  contained  in  this  document  is  distinguished  from  the  work  noted  above 
in  that  the  processor  is  designed  for  general  vector  processing  as  well  as  FFT 
computation.  The  architecture  presented  in  this  dissertation  is  particularly  well  suited  for 
input  vectors  of  length  2^°  and  larger.  The  application  is  scaleable  providing  a  real-time 
or  near-real-time  response.  Major  emphasis  is  on  a  low-cost  memory  design. 

D.  ORGANIZATION  OF  DISSERTATION 

The  following  conventions  will  be  used  to  more  clearly  identify  features  of  the 
document.  Regular  text  is  in  Times  New  Roman  font.  Computer  program  names, 
algorithms,  and  variables  are  printed  with  Arial  font.  Other  variables  discussed  in  a 
different  context  than  a  program  are  shown  as  italic  Times  New  Roman. 

The  remainder  of  this  dissertation  is  organized  as  follows.  Chapter  n.  Historical 
Perspective  and  Related  Research,  provides  a  brief  description  and  comparison  of  several 
computer  architectures  and  a  comparison  of  cache  and  interleaving  memory  schemes.  A 
history  of  related  research  in  interleaved  memory  is  then  presented. 

The  next  chapter.  Butterfly  Machine  Architecture,  describes  the  butterfly  machine 
architecture  and  provides  a  context  for  use  of  STM. 

Chapter  IV,  Description  of  Split  Transaction  Memory  (STM),  presents  STM  first 
at  a  conceptual  level,  followed  by  a  hardware  design.  A  description  of  the  STM 
Simulator  is  then  presented. 

A  theoretical  model  of  STM  performance  parameters  is  described  in  Chapter  V, 
first  using  conventional  bank  number  decoding,  followed  by  permutation-based  decoding. 
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Chapter  VI,  Simulation  Studies,  describes  the  experiments.  The  theoretical 
performance  of  the  STM,  based  on  the  results  of  Chapter  V,  is  detailed  for  each 
experiment.  The  results  of  the  simulation  runs  of  each  experiment  is  described  and 
compared  to  the  theoretical  performance.  Conclusions  that  are  specific  to  an  experiment 
are  also  stated. 

Chapter  VII  lists  top  level  conclusions  and  describes  further  research. 

The  following  section  describes  the  original  contributions  of  this  work. 

E.  ORIGINAL  CONTRIBUTION 

The  primary  contribution  of  this  research  is  an  attached  vector  processor 
architecture  designed  for  executing  algorithms  that  require  an  efficient  implementation  of 
vector  processing  in  general,  and  the  fast  Fourier  transform  (FFT)  in  particular.  Classes 
of  problems  addressed  by  this  type  of  architecture  include  signal  processing,  spectral 
analysis,  digital  filtering,  and  cyclostationary  algorithms.  Cyclostationary  processing  is 
particularly  appropriate  for  this  architecture  because  of  its  computational  complexity. 

This  architecture,  referred  to  as  the  butterfly  machine  architecture,  provides  a  scaleable 
solution  compatible  with  workstation  environments.  A  key  component  of  this 
architecture  is  the  memory  design  referred  to  as  Split  Transaction  Memory  (STM).  STM 
exploits  the  specific  memory  reference  stream  characteristics  associated  with 
cyclostationary  processing  and  provides  a  throughput  to  the  vector  processor  that 
approaches  1.0  for  anticipated  address  patterns.  There  are  two  aspects  of  STM  that  are  of 
special  note. 

•  STM  is  an  interleaved  memory  that  buffers  memory  references.  The 
primitive  organizational  element  for  buffering  within  a  bank  is  referred  to  as  a 
cache  element.  The  use  of  cache  elements  within  banks  provides  a  more 
efficient  organization  than  standard  buffers  when  three  or  more  cache 
elements  are  called  for  in  each  bank. 

•  STM  uses  a  memory  decoding  scheme  that  is  optimized  for  memory 
reference  patterns  that  are  characterized  by  powers  of  two.  This  is 
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accomplished  by  using  permutation  matrices  to  decode  bank  numbers.  A 
design  methodology  is  developed  for  constructing  permutation  matrices  that 
are  designed  for  address  patterns  with  any  constant  stride  of  powers  of  two 
that  yield  near  ideal  performance  for  interleaved  memory  systems.  A  second 
methodology  is  presented  that  results  in  permutation  matrices  that  yield  near 
ideal  performance  for  constant  geometry  radix-r  butterfly  address  patterns. 
The  radix-r  butterfly  permutation  matrices,  modified  to  support  constant 
stride  of  powers  of  two  address  patterns,  provide  near  ideal  performance  for 
constant  stride  and  radix-r  butterfly  address  patterns.  The  third  address 
pattern  required  for  FFT-based  vector  processing,  digit  reversal,  also  yields 
near  ideal  performance  when  the  radix  of  the  butterfly  is  equal  to  or  greater 
than  the  number  of  banks.  When  this  condition  is  not  met,  the  actual 
performance  varies  from  near  ideal  to  fair.  Theory  is  developed  for  steady 
state  throughput  and  maximum  latency  for  each  of  the  address  patterns. 

Another  unique  contribution  of  this  research  is  an  event  driven  software  simulator 
that  provides  for  analysis  of  STM  memory  systems.  The  STM  simulator  accepts  a 
description  of  the  STM  memory  and  a  memory  reference  stream.  When  this  memory 
reference  stream  is  processed,  details  of  each  cycle  are  stored  at  the  register  level  for  later 
analysis.  Post  analysis  routines  provide  plots  and  tables  for  analysis  of  the  simulation 
run.  Programs  have  also  been  developed  to  generate  input  address  streams  for  constant 
stride,  radix-r  butterfly,  digit  reversed,  and  random  address  streams. 
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II.  HISTORICAL  PERSPECTIVE  AND  RELATED 


RESEARCH 


A.  THE  GENERAL  PROBLEM 

A  well  designed  computer  system  is  one  that  exhibits  a  balance  of  processing 
capability  and  communication  bandwidth  among  the  various  components,  delivered  at  a 
favorable  cost-performance  ratio.  This  balance  is  established  in  the  context  of  an  existing 
technology  base.  Since  the  advent  of  the  microprocessor,  processor  design  has  been  at 
the  forefront  of  computer  architecture.  Further,  the  combination  of  advances  in  clock 
rates  made  available  with  improvements  in  the  electronics,  and  architectural  advances 
such  as  the  issuing  of  multiple  instructions  per  clock  cycle,  has  resulted  in  an  increasing 
gap  between  processor  computational  capability  and  the  ability  for  memory  systems  to 
provide  data  at  sufficient  bandwidth  to  support  these  computations  for  general-purpose 
processors  Comerford  [Ref  16].  This  chapter  will  summarize  techniques  that  have  been 
explored  to  enhance  the  memory  system  of  computers. 

Before  proceeding  further,  it  should  be  noted  that  the  scope  of  memory  design 
techniques  has  been  strongly  influenced  by  the  type  of  computer  architecture  under 
consideration.  The  advent  of  a  variety  of  multiprocessor  architectures  has  provided  both 
new  challenges  as  well  as  opportunities.  Classes  of  computer  architectures  that  will  be 
discussed  below  in  the  context  of  the  processor-memory  imbalance  are  the  multiple 
instruction  multiple  data  (MIMD)  and  the  vector  processor  architectures. 

Two  prominent  MIMD  architectures  that  have  evolved  and  are  prominent  today 
are  the  distributed-memory  architecture,  and  the  shared-memory  architecture.  The 
distributed-memory  architecture  extends  the  von  Neumann  architecture  by  connecting 
single-instruction  single-data  (SISD)  machines  with  local  and  wide-area  networks.  This 
provides  an  alternative  to  larger  monolithic  computer  systems,  namely  a  system  of 
smaller  computers  networked  together.  To  the  degree  that  this  implies  the  need  for 
smaller  less  capable  processors  in  the  networked  system,  the  processor-memory 
imbalance  is  eased. 
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The  shared-memory  architecture  is  based  on  two  or  more  processes  sharing  a 
memory  address  space.  These  processors  may  be  centralized  or  distributed  physically  as 
indicated  in  Hennessy  [Ref  17].  Clearly  increasing  the  ratio  of  processors  to  a  memory 
further  increases  the  imbalance  between  processor  capabilities  and  the  corresponding 
demands  on  the  memory  system.  However,  this  architecture  provides  an  opportunity  to 
exploit  economies  of  scale  of  the  memory  system.  Further,  this  architecture  generates  an 
address  stream  that  is  a  composite  of  the  address  streams  generated  by  the  individual 
processors.  This  multiprocessor  address  stream  may  have  characteristics  that  are 
exploitable  for  improving  memory  performance. 

A  generic  vector  processor  architecture  is  shown  in  Figure  II.  1 .  The  vector 
processor  architecture  usually  consists  of  one  or  more  special  purpose  vector  processors 
serviced  by  a  memory  system.  A  vector  supercomputer,  such  as  the  Cray  Research  Y-MP 
is  typically  designed  with  vector  processors  and  also  contains  one  or  more  general- 
purpose  processors  that  can  operate  on  scalar  values.  But  as  the  name  suggests,  the 
processor  is  specially  designed  to  operate  upon  one  or  more  vectors.  Typically,  a  single 
processor  will  accept  two  vectors  and  generate  a  third  vector  as  an  output.  The  resulting 
output  vector  of  one  processor  may  serve  as  an  input  vector  to  a  second  processor.  This 
provides  for  high-level  pipelining  of  the  algorithm.  Since  the  operation  performed  by  a 
vector  processor  is  also  typically  pipelined,  a  new  piece  of  data  is  generally  required  for 
each  clock  cycle.  For  a  vector  processor  accepting  two  vectors  as  inputs  and  generating  a 
third  as  output,  three  memory  references  are  needed  each  cycle.  Further,  pipelining  the 
processor  allows  these  processors  to  operate  at  higher  clock  rates  than  normally  found  in 
computer  systems.  The  high  clock  rate,  the  need  for  a  data  element  from  each  vector  each 
clock  cycle,  and  the  existence  of  multiple  vectors  provides  a  substantial  load  on  the 
memory  system. 

The  preceding  discussion  suggests  that  there  are  many  computer  architectures  and 
that  the  features  of  the  particular  architecture  will  have  an  impact  on  the  memory 
requirements  and  design.  This  dissertation  will  focus  on  a  variation  of  the  vector 
processor  architecture.  This  architecture,  referred  to  as  the  butterfly  architecture,  will  be 
presented  in  the  next  chapter.  As  a  vector  processor,  it  has  many  of  the  properties 
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described  for  the  vector  processor  architecture  above.  It  differs  in  that  it  is  exclusively  a 
vector  machine  (i.e.,  it  does  not  perform  any  scalar  operations). 


Figure  II.l  Generic  Vector  Processor  Architecture 

The  remaining  portion  of  this  chapter  will  describe  in  some  detail  the  two  primary 
techniques  used  to  build  memory  systems.  These  techniques  are  known  as  cache  and 
interleaved  memory  systems.  Before  discussing  cache  and  interleaved  memory  system,  a 
brief  discussion  of  the  characteristics  of  a  memory  address  stream  will  be  presented. 
Those  characteristics  that  effect  the  memory  address  stream  will  also  be  addressed. 
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Much  of  the  material  presented  in  this  chapter  is  a  summary  of  ideas  that  can  be 
found  in  many  sources.  Cache  and  interleaved  memory  concepts  can  be  found  in  Stone 
[Ref  18],  Another  source  for  cache  memory  is  Hennessy  [Ref  19]. 

B.  MEMORY  ADDRESS  STREAM 

In  this  section,  the  characteristics  of  a  memory  address  stream  will  be  described 
for  a  general-purpose  processor  and  for  a  vector  processor.  Any  method  used  to  describe 
or  characterize  the  memory  address  stream  for  the  purpose  of  anticipating  future  memory 
references  is  referred  to  as  a  characteristic. 

The  first  factor  to  consider  that  effects  the  characteristic  of  a  memory  reference 
stream  is  the  type  of  processor  that  is  generating  the  address  stream.  Two  types  of 
processors  will  be  considered  here:  a  general-purpose  processor  and  a  vector  processor. 

First,  the  general-purpose  processor  will  be  considered.  A  much  utilized  example 
of  a  memory  reference  stream  characteristic  for  general-purpose  processors  is  locality  of 
reference.  It  has  been  postulated  and  confirmed  under  many  circumstances  that  addresses 
close  by  in  the  memory  space  to  the  most  recently  accessed  memory  address,  are  more 
likely  to  be  addressed  in  the  near  future  than  those  that  are  not  nearby.  Another  example 
of  a  general-purpose  processor  characteristic  is  that  instruction  fetches  have  a  tendency  to 
be  sequential  or  linear  (i.e.,  the  execution  of  a  set  of  instructions  that  do  not  contain 
branches  will  follow  one  after  the  other.) 

A  model  of  general-purpose  processing  architecture  (i.e.,  von  Neumann)  with  the 
three  basic  components  of  processor,  interconnect,  and  memory  is  shown  Figure  11.2. 

The  processor  establishes  the  de  facto  requirements  for  memory  accesses  for 
which  the  interconnect  and  memory  must  respond.  Said  in  another  way,  the  processor  is 
usually  thought  of  as  taking  the  active  role  generating  the  memory  reference  stream.  This 
is  accomplished  by  repeating  the  execution  cycle  consisting  minimally  of  a  fetch,  decode, 
and  execution  cycles.  The  two  types  of  memory  references  generated  by  the  processor  are 
instruction  fetches  and  data  read  or  write  references.  As  indicated  earlier,  instruction 
references  demonstrate  a  linearity  property  because  they  are  typically  segments  of 
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instructions  in  a  program  that  will  execute  without  branching.  Another  characteristic  of 
instruction  fetches  is  that  they  are  almost  always  read  only. 


Figure  11.2  General-Purpose  Processor 

The  address  stream  characteristics  are  influenced  by  the  following  factors: 

•  characteristics  of  the  processor, 

•  characteristics  of  the  software  development  tools,  and 

•  application  program  characteristics. 

Characteristics  of  the  processor  include  first  the  instruction  set  architecture  (ISA). 
The  majority  of  the  instructions  in  a  complex  set  instruction  computer  (CISC)  architecture 
such  as  the  Motorola  680X0  series  contain  memory  references  as  an  integral  part  of  the 
instruction  (i.e.,  one  or  two  memory  references  are  made  as  a  result  of  the  execution  of 
the  instmction.  This  is  in  contrast  to  a  reduced  instruction  set  computer  (RISC)  ISAs 
where  all  memory  references  are  accomplished  with  dedicated  memory  reference 
instructions.  The  ratio  of  instruction  references  to  data  references  is  generally  higher 
because  the  set  of  RISC  instructions  is  simpler  and  fewer  by  design.  Therefore,  the 
sequential  characteristic  is  more  pronounced. 

The  number  of  registers  available  to  the  processor  also  effects  the  memory  address 
stream.  The  more  registers  available,  the  more  variables  can  be  maintained  at  the 
processor  without  read  and  write  accesses  back  to  main  memory.  For  a  larger  number  of 
registers,  the  number  of  memory  references  will  decline  in  general,  and  the  ratio  of  the 
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number  of  instruction  references  to  data  references  will  increase.  Most  programs 
however  cannot  use  more  than  20  to  30  registers.  Thus,  the  more  recent  RISC  designs  are 
based  on  32  registers.  When  more  silicon  is  allocated  to  registers  such  as  with  the  Sun 
Sparc  design,  the  number  of  registers  for  one  context  is  limited  to  32  registers.  [Ref  20] 

Software  languages  and  their  corresponding  compilers  also  have  an  impact  on  the 
character  of  the  memory  reference  stream.  Extensive  use  of  looping  constructs  such  as 
the  WHILE  statement  provides  for  locality  of  reference  whenever  the  loop  executes 
multiple  times.  Also,  the  longer  the  loop,  the  greater  the  linearity  of  the  memory 
reference  stream.  The  programming  practice  of  modular  decomposition  and  the  use  of 
the  function  construct  also  yields  locality  of  reference.  Allocation  of  memory  for  data 
also  provides  some  locality  of  reference.  For  example,  in  the  C  programming  language, 
local  variables  of  a  function  are  stored  together.  Variable  passing  using  the  stack 
provides  some  locality  of  reference.  However,  dynamic  allocation  of  memory  is 
accomplished  from  a  data  structure  referred  to  as  a  heap.  Dynamic  allocation  can  result 
in  variable  references  to  be  spread  about  the  address  space  if  they  are  allocated  and 
deallocated  frequently. 

The  last  factor,  application  program  characteristics,  provides  the  biggest 
uncertainty  regarding  the  memory  reference  stream.  A  program  language  provides 
substantial  flexibility  regarding  the  implementation  of  a  program.  Given  the  particular 
design  decisions  of  any  memory  system,  it  is  possible  to  write  an  application  that  will 
exploit  the  weaknesses  of  the  memory  system. 

The  other  type  of  processor  that  will  be  discussed  is  the  vector  processor.  A 
vector  processor  accepts  one  or  more  vectors  as  input,  as  well  as  an  operation  or  function 
code,  that  specifies  the  function  to  be  performed  as  shown  in  Figure  II.3.  The  vector 
processor  will  accept  one  data  input  from  each  input  vector  on  each  clock  cycle.  Further, 
the  processor  will  perform  an  operation  repeatedly  on  a  finite  number  of  data  points.  For 
example,  if  the  operation  was  addition,  then  the  processor  would  add  each  pair  of  points 
from  two  vectors.  A  radix-4  operation  would  perform  the  butterfly  operation  on  data  sets 
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Figure  II.3  Vector  Processor 

An  important  aspect  of  the  vector  processor  is  that  the  operation  to  be  performed 
is  by  nature  repetitive  and  therefore  the  need  for  instruction  fetches  is  sparse  relative  to 
data  references.  Therefore,  for  practical  purposes,  the  instruction  fetches  may  be  ignored 
in  some  circumstances. 

The  data  reference  stream  has  several  important  properties.  First,  for  a  given 
vector  operation,  a  vector  is  either  an  input  or  an  output  for  all  data  points.  Therefore,  the 
memory  reference  stream  will  either  be  a  series  of  reads  or  writes  with  respect  to  the 
vector.  Further,  for  a  given  operation,  vectors  are  accessed  in  a  well  defined  path  and  not 
subject  to  run  time  decisions.  In  other  words,  the  memory  address  pattern  for  data 
references  for  a  vector  machine  are  primarily  determined  at  compile  time.  For  one  vector 
operation  and  the  associated  data,  a  vector  processor  could  generate  the  entire  memory 
reference  stream  prior  to  executing  the  first  instruction!  This  is  in  sharp  contrast  to  the 
general-purpose  computing  case  where  the  next  memory  reference  may  be  determined  by 
the  results  of  executing  the  current  instruction.  At  a  higher  level  of  program  control, 
there  may  be  conditional  branch  instructions  that  may  have  to  be  evaluated  before  a 
particular  vector  operation  can  be  executed. 


There  are  three  addressing  patterns  that  are  of  interest  for  the  butterfly  machine 
architecture  to  be  described  in  Chapter  0: 

•  constant  stride  s, 

•  constant  geometry  radix-r  butterfly,  and 

•  digit  reversal. 

The  most  common  addressing  patterns  used  in  vector  machines  are  patterns  of 
constant  stride  s,  where  5  is  the  spacing  between  the  references.  For  example,  a  vector 
multiply  of  two  vectors  would  require  a  constant  stride  of  one  for  each  input  vector  as 
well  as  the  output  vector. 

The  constant  geometry  radix-r  butterfly,  and  the  digit-reversed  pattern  are  both 
used  to  compute  FFTs.  The  constant  geometry  radix-r  butterfly  pattern  is  composed  of  a 
number  of  constant  stride  sequences.  One  pass  of  the  digit  reversal  pattern  is  required  for 
each  FFT.  A  discussion  of  these  memory  reference  patterns  can  be  found  in  Oppenheim 
[Ref  21]. 

C.  CACHE  MEMORY 

There  are  two  basic  memory  enhancement  techniques  that  have  been  developed  to 
minimize  the  impact  of  the  processor-memory  imbalance,  namely  cached  and  interleaved 
memory.  Cache  memory  is  by  far  the  most  pervasive  because  it  has  been  found  to  be 
effective  when  dealing  with  the  general-purpose  computer  architecture.  It  is  so  successful 
that  almost  any  computer  system  acquired  today  will  have  at  least  one  cache  in  the 
memory  system  and  frequently  more  than  one.  A  cache  memory  system  exploits  the 
locality  of  reference  property  described  in  the  section  above. 

Banked  interleaved  memory  has  been  used  in  a  general-purpose  architecture  as  a 
secondary  enhancement  technique  to  cache  memory.  However,  it  is  the  primary  means 
for  increasing  memory  bandwidth  for  vector  processor  architectures  such  as 
supercomputer  vector  processors. 

Figure  II.4  illustrates  the  physical  organization  of  a  cache  memory  system.  The 
cache  memory  is  a  small  memory  when  compared  with  the  main  memory,  but  operates  at 
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the  same  speed  as  the  processor.  It  logically  can  be  divided  into  a  cache  memory,  and  a 
cache  memory  controller.  From  the  processor’s  interface  looking  down,  the  cache  looks 
like  main  memory  where  the  memory  response  time  is  not  constant.  To  main  memory, 
the  cache  appears  to  be  a  bus  master  that  always  requests  a  block  of  memory  references  at 
a  time. 


Figure  II.4  Cache  Memory  System 


The  cache  memory  is  organized  into  equal  sized  blocks  referred  to  as  lines  or 
cache  lines.  Representative  sizes  for  cache  lines,  /,  range  between  16  and  64  bytes.  Main 
memory  is  logically  organized  into  blocks  of  length  /. 

Whenever  a  read-memory  reference  is  made  by  the  processor  to  the  cache,  the 
reference  is  either  contained  in  the  cache  or  it  is  not.  This  is  termed  a  cache  hit  or  miss 
respectively.  When  a  program  begins,  the  cache  is  empty  and  therefore  the  first  reference 
is  by  definition  a  miss.  Under  these  circumstances,  the  cache  must  obtain  the  memory 
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reference  from  main  memory.  The  cache  will  obtain  the  entire  line  associated  with  the 
reference,  store  the  line  in  a  cache  line,  and  pass  the  reference  back  to  the  processor.  The 
cache  is  then  ready  to  process  another  memory  request. 

On  the  next  memory  reference  request,  if  the  request  is  not  located  in  the  cache 
(i.e.,  if  the  request  is  not  located  within  the  cache  line  that  was  previously  loaded),  then 
the  process  repeats  as  before.  If  however,  the  reference  is  contained  in  the  cache,  then  the 
cache  simply  responds  with  the  data.  The  dedicated  interconnect  between  the  cache  and 
the  processor  will  generally  allow  the  data  access  to  proceed  at  the  processor  clock  rate. 

The  bandwidth  of  a  cache-based  memory  system  may  be  modeled  in  terms  of  the 
effective  cycle  time.  The  effective  cycle  time  Teff  is  defined  as  the  average  cycle  time  to 
access  one  word  when  filling  a  cache  line,  adjusted  for  the  number  of  elements  of  the  line 
not  actually  used  and  the  number  of  elements  of  a  line  used  more  than  once.  This  is  the 
effective  bandwidth  as  seen  by  the  processor.  The  effective  bandwidth  can  be  expressed 
as: 


T  =T 


I  I 
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where. 


(11.2) 


and 


Tc  is  the  time  required  to  fill  a  cache  line, 

I  is  the  number  of  words  in  a  cache  line, 

la  is  the  number  of  words  in  a  cache  line  that  are  accessed  by  the  processor,  and 

Ir  is  the  number  of  times  that  words  in  a  cache  line  are  accessed  by  the  processor 
after  the  first  access  (i.e.,  repeat  accesses). 

The  first  term  of  Equation  (II.  1),  Tea  represents  the  average  cycle  time  to  access 
one  word.  The  second  and  third  terms  reflect  adjustments  based  on  the  degree  that  the 
locality  of  reference  property  is  present.  The  second  term  is  the  fraction  of  the  words  in 
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the  line  actually  utilized  by  the  processor.  If  all  of  the  words  are  accessed,  then  this 
reduces  to  unity.  If  a  fraction  of  the  words  are  used,  then  the  average  cycle  time  is 
adjusted  by  the  reciprocal  of  that  fraction.  The  third  term  reflects  the  benefit  for  reusing  a 
word  without  having  to  fetch  it  back  from  main  memory.  If  no  words  are  reused,  the  term 
reduces  to  unity.  If  each  term  was  reused  once,  then  the  term  would  be  one  half. 

Equation  (II.  1)  illustrates  that  the  effective  bandwidth  provided  by  the  cache  can  vary 
substantially  in  either  direction  from  the  average  cycle  time  to  load  a  cache  line, 
depending  upon  the  locality  of  reference. 

Caches  are  classified  based  on  the  kind  of  information  that  is  to  be  cached.  There 
are  instruction  caches,  data  caches,  and  combined  caches.  An  instruction  cache  is  easier 
to  build  because  the  instruction  fetches  are  read-only  and  therefore  the  hardware 
necessary  for  maintaining  consistency  between  the  instructions  in  the  cache,  and 
instructions  in  main  memory  is  not  necessary.  Further,  if  it  is  determined  that  the  locality 
of  reference  is  different  for  instructions  and  data,  then  separate  data  and  instruction 
caches  can  be  better  tailored  to  their  respective  needs.  However,  a  combined  data  and 
instruction  cache  can  use  the  cache  resources  efficiently. 

A  short  discussion  on  the  time-varying  characteristics  of  cache  is  in  order  before 
leaving  the  discussion  of  cache  memory.  When  a  process  begins,  none  of  the  process’s 
instructions  or  data  is  contained  in  the  cache.  Most  of  the  initial  references  are  misses, 
but  as  the  process  progresses,  more  and  more  of  the  program  and  data  necessary  for  the 
process  are  loaded  into  the  cache.  The  contents  of  the  cache  are  “demand  driven”  by  the 
processor’s  references.  A  point  in  time  is  reached  where  almost  all  of  the  address 
references  are  in  the  cache  and  therefore  most  references  are  cache  hits.  This  assumes 
that  the  cache  has  sufficient  capacity  to  support  the  process.  The  period  of  time  between 
the  start  of  the  process,  until  the  process  is  mostly  cache  hits  is  referred  to  as  the  transient 
time.  That  period  of  time  beginning  with  mostly  cache  hits  is  referred  to  as  steady-state 
time.  When  a  process  transitions  to  another  part  of  the  program,  or  when  another 
process’s  context  is  switched  in,  then  another  transient  is  experienced  followed  by  a 
steady-state  period. 
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A  process’s  memory  address  space  as  well  as  the  cache  lines  can  be  illustrated 
graphically.  The  area  of  the  memory  address  space  that  is  contained  in  the  cache  or 
alternatively,  those  cache  lines  that  contain  the  process’s  references  at  the  point  of  steady 
state,  is  referred  to  as  the  process’s  foot  print.  If  such  a  graphic  were  available  and 
updated  in  real  time,  it  would  show  at  any  instant  that  portion  of  the  memory  space  that  is 
active.  The  time-varying  dynamics  of  the  memory  address  space  and  the  cache  would  be 
viewed  through  continually  updated  graphics  throughout  the  life  cycle  of  the  processes. 
[Ref  22] 

D.  INTERLEAVED  MEMORY 

The  other  memory  management  technique  to  be  described  is  interleaved  memory. 
A  block  diagram  of  a  banked  interleaved  memory  is  shown  in  Figure  II.5.  Memory 
devices  (e.g.,  DRAMs)  are  mapped  into  the  address  space  such  that  the  memory  address 
space  is  partitioned  evenly  among  the  banks.  The  primary  parameters  that  define  an 
interleaved  memory  scheme  include  the  number  of  banks  and  the  scheme  for  mapping 
memory  addresses  to  a  bank  number  and  index  within  a  bank  pair.  This  will  be  referred 
to  as  the  bank  number  decoding  and  bank  index  decoding  schemes  respectively.  In 
general  it  is  desirable  to  have  a  large  number  of  banks  since  the  potential  data  rate  is 
greater.  Electrical  properties  such  as  fanout  suggest  a  cost  associated  with  more  banks 
and  therefore  a  cost  benefit  tradeoff  must  be  evaluated  for  a  given  application.  As  is 
indicated  in  the  discussion  below,  the  bank  number  selection  criteria  also  may  have  an 
impact  on  the  number  of  banks  chosen. 

The  following  is  a  brief  description  of  the  operation  of  a  banked  memory  system 
that  incorporates  interleaving  to  increase  memory  performance.  A  bank  will  accept  a 
memory  request  if  the  bank  is  not  processing  a  previous  memory  request.  Therefore,  as 
many  as  k  memory  requests  can  be  pending  at  a  time  (i.e.,  one  from  each  bank).  If  a  busy 
bank  is  selected  (i.e.,  if  it  is  processing  a  previous  memory  request)  then  the  memory 
system  stalls.  A  memory  system  is  said  to  stall  when  the  current  memory  request  is  not 
accepted.  No  other  memory  requests  will  be  allowed  until  the  selected  bank  has 
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completed  the  current  memory  request.  The  stalled  memory  request  is  then  accepted  and 
the  process  continues. 


If  each  of  the  banks  can  be  kept  busy,  then  a  total  memory  bandwidth  of  kB  words 
per  second  can  be  obtained  from  the  memory  system  where  k  is  the  number  of  banks  and 
B  is  the  bandwidth  of  a  single  bank.  In  order  for  this  to  occur,  all  banks  must  be 
continuously  processing  memory  access  requests  and  the  memory  ratio  must  be  less  than 
or  equal  to  the  number  of  banks.  This  is  accomplished  for  example,  if  the  banks  are 
selected  in  a  round  robin  fashion  (e.g.,  0,  1,  2,  3  ...  k-2Jc-\,  0,  1,2, ...  k-2,  k-\,  0,  1  ... 
where  k  is  the  number  of  banks).  The  effectiveness  of  interleaved  memory  is  then 
directly  related  to  the  ability  to  keep  the  banks  busy  which  is  accomplished  by  providing  a 
work  distribution  that  is  approximately  uniform  over  time. 


The  three  primary  performance  measurements  of  interest  for  interleaved  memory 
systems  in  this  effort  are: 


•  latency  (L), 


•  throughput  (TP),  and 
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•  speedup  (5). 

Latency  (L)  is  defined  as  the  number  of  memory  cycles  from  the  time  a  processor 
attempts  to  issue  a  memory  reference  request,  until  the  time  the  request  is  completed. 
Note  that  latency  contains  two  basic  components.  First,  latency  occurs  due  to  the  delay  in 
the  memory  bank  necessary  to  service  a  memory  request.  Second,  latency  will  increase  if 
the  memory  system  is  saturated  and  therefore  the  memory  system  does  not  accept 
additional  memory  references. 

Latency  is  also  time-varying,  and  can  be  measured  at  each  point  a  memory 
response  is  completed.  This  time-varying  view  of  latency  can  also  be  depicted 
graphically.  Scalar  measures  of  latency  include  maximum  latency  (Lmax),  average  latency 
(Lavg),  and  the  standard  deviation  of  the  latency  (Lstd)- 

The  memory  ratio  {MR),  introduced  in  the  cache  memory  section,  is  directly 
related  to  latency  in  interleaved  memory  systems.  The  minimum  latency  for  an 
interleaved  memory  system  is  the  memory  ratio  plus  any  overhead  related  to  the 
interleaved  memory  system.  Interleaved  memory  systems  generally  use  registers  to 
receive  an  input  and  for  placing  data  onto  the  bus  for  read  requests.  This  adds  two  cycles 
to  the  minimal  latency  and  therefore  the  minimum  latency  for  an  interleaved  memory 
system  will  be 

4,,=M/?  +  2.  (11.3) 


The  throughput  is  defined  to  be  the  ratio  of  the  total  number  of  memory  cycles 
required  for  an  ideal  memory  device  to  complete  a  set  of  memory  references,  to  the  actual 
number  of  memory  cycles  used  to  complete  the  set  of  memory  references  for  a  particular 
memory  design.  An  ideal  memory  device  is  defined  as  one  that  can  service  a  memory 
reference  in  one  cycle.  Throughput  may  be  expressed  as: 


TP  = 


C. 


ideal 


''actual 


(11.4) 


where: 
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Cideai  is  the  total  number  of  memory  cycles  for  a  given  task,  for  an  ideal  memory 

device.  This  is  equivalent  to  total  number  of  memory  references. 

Cactuai  is  the  actual  number  of  memory  cycles  necessary  to  complete  the  same  task 

and, 

The  memory  design  can  never  be  better  than  the  ideal  memory  device,  therefore 

Q<TP<\.  (11.5) 

Throughput  is  a  measure  of  how  well  the  processor  is  serviced  by  the  memory 
system.  In  this  analysis,  it  is  assumed  unless  otherwise  stated,  that  the  processor  will 
issue  one  memory  access  per  system  clock  cycle  unless  the  memory  system  blocks  the 
request.  Therefore,  a  throughput  of  1 .0  indicates  that  the  processor  is  provided  one 
“memory  response”  for  each  clock  cycle. 

Another  measure  of  throughput  is  the  steady-state  throughput.  In  a  manner 
analogous  to  cache  memories,  there  is  a  period  of  time  when  the  memory  goes  through  a 
transitory  period  which  is  reflected  in  an  irregular  output.  This  is  followed  by  a  period 
where  is  output  is  periodic  (e.g.,  frequently  a  constant).  These  two  periods  will  be 
referred  to  as  the  transient  and  the  steady-state  response  of  the  interleaved  memory 
system.  Of  particular  interest  are: 

•  The  length  of  the  transient  response  {T,^.  It  is  desirable  that  this  figure 
approach  the  minimum  latency,  and  that  it  be  a  small  fraction  of  the  length  of 
the  vector  processed. 

•  Steady-state  throughput  (rP„).  The  steady-state  throughput  is  a  better 
measure  of  throughput  because  it  eliminates  the  effects  of  the  transient. 
However,  this  measurement  is  only  valid  when  the  transient  response  is  a 
small  fraction  of  the  vector  length  as  indicated  above. 

Speedup  is  a  performance  measure  that  focuses  on  the  relative  improvement 
gained  when  adding  additional  memory  components.  Speedup  (5)  is  defined  as  the  ratio 
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of  number  of  memory  cycles  necessary  to  complete  a  given  task  using  one  memory  bank 
to  the  number  of  memory  cycles  necessary  to  complete  the  same  task  using  k  banks  or 


where: 


(II.6) 


Cl  is  the  number  of  memory  cycles  required  for  one  bank,  and 
Ck  is  the  number  of  memory  cycles  required  for  k  banks. 

Note  that  Ci  is  the  product  of  the  number  of  memory  references,  Cideau  and  the 
memory  ratio  {MR).  It  can  be  seen  that  the  relationship  between  throughput  and  speedup 
is  a  mutiplicative  factor  of  the  memory  ratio  as  shown  below: 


_  ^ideal 


MR 


MR. 


(11.7) 


Both  throughput  and  speedup  are  performance  measures  of  memory  bandwidth. 
Both  will  be  viewed  as  a  scalar  measure  of  performance  as  defined  by  the  formulas 
above.  Throughput  may  also  be  defined  as  a  moving  average,  capturing  the  time-varying 
quantity  of  throughput.  This  can  be  illustrated  as  a  line  graph. 

One  characteristic  of  bulk  memory  that  has  motivated  the  use  of  banked 
interleaved  memory  is  the  difference  between  the  memory  access  time  (t^)  and  the  cycle 
time  (C).  Many  devices  (e.g.,  DRAMs)  have  a  cycle  time  that  is  greater  than  the  access 
time  because  of  overhead  tasks  that  must  be  completed  prior  to  beginning  another  access. 
For  example,  a  read  operation  to  a  DRAM  memory  cell  destroys  the  contents  of  the  cell. 
The  original  contents  must  be  written  back  to  the  cell  to  preserve  the  value.  The 
relationship  between  the  access  and  cycle  times  can  be  expressed  as 

t^<k-t^  k  =  2,3,4...  (IL8) 


If  the  number  of  banks  is  selected  such  that 


where  B  is  the  number  of  banks,  then  the  overhead  time  can  be  absorbed  if  all  banks  can 
be  kept  busy.  The  memory  can  then  operate  at  the  access  rate  rather  than  the  cycle  rate. 

Although  banked  interleaved  memory  has  been  used  to  enhance  a  cached  memory 
scheme,  the  number  of  banks  is  small  typically  two  or  four.  Generally,  an  interleaved 
memory  with  a  large  number  of  banks  is  used  for  vector  processing.  One  category  of 
vector  processing  is  supercomputer  vector  machines  such  as  the  Cray  I,  the  Burroughs 
Scientific  Processor  (BSP),  and  the  Convex  C3800.  Another  category  is  the  attached 
vector  processor.  An  example  of  an  attached  vector  processor  is  Floating-Point  System’s 
FPS-164  [Ref  23]. 

Two  important  characteristics  of  vector  processors  include  the  ability  to  perform 
scalar  operations  and  a  memory  system  that  is  hierarchical.  The  need  for  scalar 
operations  is  clear  in  a  supercomputer  where  a  relatively  large  computational  problem,  is 
expected  to  be  solved  without  additional  computer  support. 

The  Cray  I,  illustrated  in  Figure  11.6,  is  an  example  of  a  supercomputer  with  a 
memory  hierarchy,  and  the  ability  to  perform  scalar  as  well  as  vector  operations.  There  is 
one  main  memory  which  provides  for  the  majority  of  storage.  The  fastest  memories  are 
connected  directly  to  the  pipelined  arithmetic  units.  The  Arithmetic  Units  perform  scalar 
as  well  as  vector  operations.  By  placing  the  highest  speed  memories  next  to  the 
processor,  the  processor  can  operate  at  an  optimal  speed  so  long  as  the  data  is  contained 
in  the  high-speed  memories. 

This  is  valid  when  the  algorithm  can  be  written  in  such  a  way  that  data  is 
repeatedly  accessed  before  returning  to  main  memory.  Alternatively,  it  can  be  said  that 
the  data  has  locality  of  reference  at  a  high  level  of  granularity.  However,  in  this  instance, 
the  programmer  is  responsible  for  managing  all  of  the  levels  of  memory  (i.e.,  it  is  not 
accomplished  automatically  as  was  done  with  cached  memory  in  the  general-purpose 
computing  case). 
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When  an  attached  processor  is  used  in  conjunction  with  a  personal  computer  or 
workstation,  it  is  reasonable  to  expect  that  the  workload  can  he  partitioned  such  that  a 
part  of  the  algorithm  will  be  executed  on  the  host’s  general-purpose  processor,  the 
remainder  on  the  vector  processor.  In  order  to  reduce  complexity  and  cost,  the  vector 
machine  may  be  optimized  to  perform  vector  operations  and  therefore  mitigate  the  need 
for  scalar  operations  on  the  vector  processor.  The  lack  of  scalar  operations  will  in  turn, 
reduce  the  likelihood  of  repeated  use  of  data  before  returning  it  to  the  main  memory. 

This  in  turn  suggests  that  memory  schemes  without  a  hierarchy  may  be  appropriate  for  an 
attached  vector  processor. 


Figure  II.6  Cray  I  Memory  Hierarchy  [Ref  24] 


Interleaved  memory  systems  are  also  designed  to  take  advantage  of  a 
characteristic  of  the  memory  reference  stream.  Therefore,  as  was  the  case  for  cache 
memory  systems,  the  performance  of  the  interleaved  memory  system  is  highly  dependent 
on  the  particular  program  executed. 
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For  example,  it  has  been  observed  that  a  purely  random  addressing  pattern  has  a 
speedup  that  can  be  expressed  as 


ai.io) 


where  B  is  the  number  of  banks  [Ref  25].  This  is  a  disappointing  result,  given  that  the 
bandwidth  is  proportional  to  the  square  root  of  the  number  of  banks.  This  result,  coupled 
with  the  large  values  for  latency,  has  discouraged  the  use  of  interleaved  memories  with  a 
large  number  of  banks  in  general-purpose  computing. 

However,  the  memory  reference  pattern  based  on  accesses  to  vectors  is  quite 
different  than  a  memory  reference  pattern  generated  by  a  general-purpose  computer. 
These  patterns  are  deterministic  and  they  are  characterized  as  having  patterns  with 
constant  stride.  Operations  such  as  vector  addition  and  multiplication  have  a  constant 
stride  of  one.  Other  operations  have  constant  strides  other  than  one.  More  complex 
address  patterns  are  found  with  operations  such  as  a  radix-r  butterfly  and  digit  reversal. 
However,  these  more  complex  patterns  have  multiple  series  of  constant  stride  embedded 
in  the  address  pattern.  A  model  for  memory  address  patterns,  as  they  related  memory 
performance,  is  presented  in  Chapter  V. 

Several  memory  decoding  schemes  will  be  described  below.  Memory  decoding 
for  banked  interleaved  memory  systems  includes  determination  of  the  bank  number  and 
the  index  within  a  bank  Frequently,  the  index  within  a  bank  is  accomplished  in  a  straight 
forward  manner  using  a  subset  of  the  address  bits.  The  primary  focus  of  the  discussion 
below  will  be  in  the  selection  of  a  bank  number.  The  motivation  for  the  different  bank 
selection  schemes  is  to  find  a  scheme  that  will  spread  memory  references  evenly  to  all  of 
the  banks  (i.e.,  in  a  round  robin  pattern),  for  the  memory  address  patterns  most  likely  to 
occur.  It  is  also  desired  that  the  bank  selection  scheme  have  the  following  properties: 

•  an  implementation  that  is  inexpensive  in  terms  of  hardware 

•  have  a  small  propagation  delay,  and 
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•  imposes  the  fewest  restrictions  on  the  number  of  banks. 

The  simplest  decoding  or  interleaving  scheme  of  the  memory  space  uses  the  least 
significant  bits  to  directly  select  a  bank,  and  the  remaining  higher  order  bits  to  select  a 
word  from  the  selected  bank.  This  will  be  referred  to  as  the  conventional  decoding  or 
interleaving  scheme.  Conventional  decoding  has  no  implementation  requirements  but 
requires  the  number  of  banks  to  be  a  power  of  two.  From  a  performance  perspective,  an 
address  pattern  with  a  constant  stride  of  one  will  result  in  a  round  robin  selection  of  the 
banks  for  optimal  utilization  of  the  banks  when  the  banks  are  decoded  using  the 
conventional  scheme.  However,  only  a  subset  of  the  banks  will  be  selected  whenever  the 
stride  is  not  relatively  prime  to  the  number  of  banks.  Specifically,  for  a  given  stride  s,  the 
number  of  banks  that  will  be  selected  is: 


^eff= - - - 

^  gcd(S,  s) 


(11.11) 


where 


B  is  the  total  number  of  banks  in  the  memory  system, 
s  is  the  stride  of  a  constant-stride  address  pattern, 

Bejf  is  the  effective  number  of  banks.  By  effective,  it  is  meant  that  an  effective 
bank  is  one  that  is  actually  given  memory  references  for  the  specified  address 
pattern. 

gcd(a,b)  is  the  greatest  common  divisor  for  a  and  b. 

A  bank  that  is  referenced  for  a  given  addressing  pattern  is  referred  to  as  an 
effective  bank.  For  example  if  the  stride  equals  the  number  of  banks  (or  a  multiple  of  the 
number  of  banks)  a  single  bank  will  receive  all  of  the  memory  requests  regardless  of  the 
number  of  banks  in  the  memory  system.  The  effect  on  the  number  of  banks,  stride,  and 
bank  selection  criteria  will  be  described  in  detail  in  Chapter  V.  Given  the  problems  noted 
above  with  strides  that  are  not  relatively  prime  to  the  number  of  banks,  coupled  with  the 
fact  that  many  algorithms  such  as  the  fast  Fourier  transform  frequently  use  powers  of  two 
strides,  other  bank  selection  criteria  have  been  investigated. 
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Linear  data  skewing  schemes  have  been  proposed  where  the  memory  bank 
selection  for  a  data  element  contained  in  an  array  at  location  row  column  indices  i,j  is 
mapped  to  bank  ipi  +  jp2.  This  method  is  hampered  by  the  need  for  arithmetic  operations 
to  compute  the  bank  number.  The  hardware  is  relatively  more  complex,  but  more 
importantly  the  time  needed  to  compute  the  bank  number  has  a  negative  impact  on 
memory  performance.  However,  one  data  skewing  scheme  referred  to  as  1-Skew,  has  an 
implementation  that  requires  only  logic  operators.  For  an  address  i,  the  bank  number  is 
computed  as: 


B,= 


I  + 


I 

B 


mods, 


(11.12) 


where 


Bi  is  the  computed  bank  number, 
i  is  the  memory  address, 
mod  is  the  modulus  operator,  and 
B  is  the  number  of  banks. 

Note  that  the  division  and  the  modulo  operations  are  trivial  when  5  is  a  power  of 
two.  This  leaves  only  the  addition  operation  to  perform. 

Considering  Equation  (H.  1 1 ),  it  can  be  seen  that  if  the  number  of  banks  in  a 
system  is  a  prime  number,  then  the  number  of  effective  banks  would  always  be  equal  to 
the  number  of  banks  except  when  the  stride  is  equal  to  a  multiple  of  the  number  of  banks. 
The  biggest  problem  with  using  a  prime  number  of  banks  is  that  a  direct  implementation 
of  such  a  scheme  is  requires  arithmetic  operations  that  are  expensive  and  incur  more 
propagation  delay  than  is  tolerable  for  performance.  Several  techniques  have  been 
proposed  to  mitigate  this  problem.  In  general  the  following  equations  are  used  to 
compute  the  bank  number  B,  and  index  into  the  bank  I : 


5,  =  i  mod  B, 


(11.13) 


(11.14) 
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where 

i  is  the  address,  and 
B  is  the  number  of  banks. 

The  Burroughs  Scientific  Processor  used  a  scheme  that  reduced  the  complexity  of 
bank  selection  to  a  single  adder  plus  logic  operators.  This  approach  to  memory  selection 
logic  is  accomplished  in  part  by  selecting  a  smaller  value  of  B  in  Equation  (11.14)  than  in 
Equation  (11.13).  For  a  smaller  value  of  3=5',  this  results  in  the  loss  of 

B-B' 

B 

of  the  memory.  [Ref  26] 

An  alternative  method  pipelines  the  computation  of  the  bank  selection  [Ref  27]. 
This  approach  is  dependent  on  a  constant-stride  address  pattern.  The  proposed 
architecture  described  in  Chapter  0  requires  addressing  patterns  that  are  not  strictly 
constant  stride. 

The  last  bank  selection  technique  to  be  reviewed  is  permutation-based 
interleaving  [Ref  28].  Permutation-based  interleaving  is  based  on  the  same  principles  as 
Hamming  error  detection  and  correction  codes.  The  bank  number  M  is  calculated  using 
the  matrix  equation: 

b  =  P  a,  (11.15) 

where 

b  is  a  (fc  X  1)  column  vector  representing  the  bank  number, 

a  is  a  (r  X  1)  column  vector  representing  some  number  (possibly  all)  of  the  bits  of 
the  memory  address,  and 

P  is  a  (A:  X  r)  matrix  that  specifies  the  bank  number  mapping. 

The  matrix  multiplication  indicated  in  the  equation  is  similar  to  matrix 
multiplication  except  that  the  multiplication  operations  are  logical  ANDs  and  the 
summation  of  product  terms  is  a  logical  exclusive  OR.  Note  that  this  scheme  requires 
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only  logical  operations  and  therefore  can  be  implemented  with  little  propagation  delay. 
Further,  a  wide  array  of  mappings  can  be  specified.  Note  that  the  conventional  bank 
decoding  scheme  is  a  subset  of  permutation-based  scheme  when  the  P  matrix  is  an 
identity  matrix  or  rank  equal  to  the  number  of  bits  in  the  bank  number.  Bank  selection 
using  permutation-based  techniques  are  explored  further  in  Chapter  0  Section  E. 

It  has  been  shown  that  for  constant-stride  address  patterns  where  the  stride  ranges 
from  two  through  64,  Skew-1  and  Permutation-Based  bank  selection  schemes  have 
substantially  better  performance  than  conventional  bank  decoding.  Permutation-Based 
bank  selection  has  slightly  better  performance  than  Skew-1.  The  performance 
measurement  in  this  study  was  throughput.  [Ref  29] 

An  enhancement  to  interleaved  memory  architecture  is  the  use  of  input  and  output 
buffers  for  each  memory  bank.  An  interleaved  memory  model  with  input  and  output 
buffers  is  illustrated  in  Figure  II.7.  An  input  buffer  of  length  is  a  first-in  first-out 
(FIFO)  queue  that  allow  a  memory  bank  to  accept  bin  memory  requests  (i.e.,  a  memory 
bank  can  accept  memory  requests  without  completing  a  memory  request  that  is  currently 
in  progress.  A  standard  interleaved  memory  architecture  is  deHned  in  this  work  to  be 
one  that  has  one  input  and  output  buffer.  A  memory  system  with  more  than  one  butter 
is  referred  to  as  a  STM  memory. 

Buffers  are  useful  for  smoothing  out  irregularities  in  the  memory  address  pattern. 
They  do  not  improve  throughput  if  there  is  an  insufficient  number  of  effective  banks.  To 
illustrate,  consider  a  standard  interleaved  memory  system  with  eight  banks  and  an 
effective  memory  ratio  of  eight.  For  a  stride  of  two,  one  possible  bank  selection  pattern  is 

{0,2,4,6,0,2,4,6,...}. 

There  are  four  effective  banks  and  the  resulting  throughput  will  be  0.5.  Adding 
buffers  will  not  improve  throughput  regardless  of  the  number  of  buffers  added  since  the 
four  banks  that  are  in  use  are  effectively  used  100  percent  of  the  time. 
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Figure  IL7  Interleaved  Memory  With  Queues  [Ref  30] 
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Compare  this  to  the  situation  where  the  same  interleaved  memory  system  is 
presented  an  address  pattern  such  that  the  bank  selection  pattern  is 


{ 0,0, 1 , 1 ,2,2,3,3,4,4,5,5,6,6,7,7,0,0, 1,1,...}. 


The  first  memory  reference  will  be  accepted  followed  by  a  stall  because  bank  0  is 
busy  with  the  first  request.  Once  the  first  request  is  completed,  the  second  memory 
reference  is  accepted  by  bank  0.  On  the  next  cycle,  the  third  memory  reference  is 
accepted  followed  by  a  stall.  This  pattern  continues  until  the  vector  is  processed. 


Now  consider  the  case  where  each  memory  bank  can  accept  two  memory  requests 
(i.e.,  the  bank  can  be  processing  one  memory  request,  and  accept  another.  In  this  case, 
each  memory  bank  will  accept  the  two  memory  requests  on  the  first  cycle.  Since  16 
cycles  will  have  passed  between  the  time  bank  0  received  the  first  memory  reference,  and 
the  time  that  the  second  cycle  begins,  each  bank  will  have  sufficient  time  to  process  the 
memory  requests  as  they  are  presented.  The  throughput  will  be  optimal  in  the  steady 
state.  Observe  that  this  use  of  buffers  will  increase  latency. 


The  Split  Transaction  Memory  (STM),  described  in  Chapter  IV,  incorporates  the 
concept  of  buffers  in  interleaved  memory.  A  high-level  view  of  STM  is  shown  in  Figure 
n.8.  Each  memory  bank  consists  of  three  components: 


•  The  bulk  storage  module  is  the  device  that  provides  for  data  storage.  The 
bulk  storage  module  contains  one  or  more  chips  (DRAMs  with  current 
technology)  and  the  refresh  circuitry. 

•  The  cache  elements  are  high-speed  memory  that  serves  as  an  intermediate 
staging  area  for  data  requests  from  the  processor,  and  memory  responses  from 
the  bulk  storage  module. 

•  Controllers  for  the  interfaces  between  the  bulk  storage  module  and  the  cache 
elements,  and  the  interface  between  the  cache  elements  and  the  memory  bus. 

On  each  cycle,  a  STM  module  may  perform  none  or  all  of  the  following 
operations: 
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•  Accept  a  memory  request  from  the  processor.  A  memory  request  is  accepted 
if  the  bank  is  selected  and  if  the  bank  is  not  full  (i.e.,  there  is  room  in  the 
cache  element  for  another  request).  Accepting  a  read  or  a  write  request 
requires  that  the  address  or  address  and  data  be  stored  in  a  cache  element 
respectively. 

•  Manage  the  bulk  storage  module.  This  includes  issuing  memory  requests  to 
the  bulk  storage  module  and  accepting  data  from  a  memory  read  request. 

•  Placing  data  on  the  bus  in  response  to  a  memory  read. 

The  cache  elements  are  managed  as  a  circular  queue  with  three  indices.  The  first 
is  used  for  marking  the  next  free  cache  element  available  for  accepting  new  memory 
requests.  The  second  index  is  used  to  track  which  memory  request  should  be  processed 
by  the  bulk  storage  module.  The  last  index  points  to  data  associated  with  a  processed 
memory  read  output. 

The  key  difference  in  buffers  and  cache  elements  is  the  organization  and  use  of 
registers.  As  illustrated  in  Figure  n.9,  both  buffers  and  cache  elements  are  used  to 
facilitate  the  transfer  of  data  between  the  bulk  store  in  a  bank  and  the  bus.  However,  a 
buffer  pair  uses  one  data  register  for  the  input  buffer  and  another  data  register  for  the 
output  register.  A  single  buffer  provides  pipelining  of  memory  requests  since  a  new 
memory  request  can  be  placed  in  the  input  register  in  parallel  with  the  bulk  store 
providing  a  memory  response  in  the  output  register.  On  the  other  hand,  a  cache  element 
only  has  one  data  register  and  therefore  two  cache  elements  would  be  required  to  provide 
pipelining  as  described  above  for  a  single  buffer. 

A  comparison  of  the  representative  storage  requirements  for  buffers  and  cache  elements  is 
also  shown  in  Figure  n.9.  Both  schemes  must  store  the  address  and  provide 
administrative  data  to  maintain  the  sequential  ordering  of  the  memory  requests.  Two 
indices  are  needed  when  maintaining  an  input  and  output  buffer  scheme  as  shown  in 
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Figure  II.8  Split  Transaction  Memory  Overview 
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Figure  n.7.  For  purposes  of  comparison,  it  is  assumed  that  addresses  and  data  are  stored 
as  four  byte  words  and  the  indices  and  other  control  data  is  contained  in  one  byte. 
Implementation  of  standard  interleaving  (i.e.  using  a  single  buffer)  is  more  efficient  using 
the  buffer  organization  since  one  buffer  requires  14  bytes.  To  obtain  the  equivalent 
pipelining  with  cache  elements  requires  two  cache  elements  and  therefore  18  bytes  of 
storage.  However,  if  a  design  calls  for  k  levels  of  buffering,  the  cache  element  scheme 
becomes  more  efficient  for  even  small  values  of  k.  In  general,  ^+1  cache  elements  are 
required  to  obtain  the  equivalent  level  of  pipelining  with  k  buffers.  For  example  for  k=2, 
28  bytes  are  needed  for  the  buffer  scheme  versus  27  bytes  for  the  cache  element  scheme. 
The  number  of  cache  elements  needed  for  a  memory  system  is  explored  in  detail  in 
Chapters  V  and  VI. 

Extensive  research  has  been  conducted  in  the  area  of  interleaved  memories.  One 
focus  of  this  research  has  been  the  nature  of  the  address  stream.  Early  work  includes 
Hellerman  [Ref  31]  which  is  based  on  a  random  address  stream.  Later  efforts  include 
Chang  [Ref  32]  and  Rau  [Ref  33]  which  provide  several  dependency  models  of  the  data. 
Several  studies  have  proposed  architectural  enhancements  such  as  the  separation  of 
instruction  and  data  accesses  to  the  memory  system  Coffman  [Ref  34].  Burnett  [Ref  35] 
and  Dbois  [Ref  36],  and  Sohi  [Ref  37]  have  investigated  different  uses  of  buffers. 
Multiprocessor  structures  are  analyzed  in  Baskett  [Ref  38]  and  Briggs  [Ref  39].  Fault 
tolerance  is  described  in  the  context  of  interleaved  memories  in  Cheung  [Ref  40]. 

In  the  next  chapter,  an  architecture  for  an  attached  vector  processor  designed  to 
compute  spectral  correlation  functions  will  be  described.  The  need  for  an  efficient  low- 
cost  memory  system  will  become  clear  as  this  design  is  described. 
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Queue  Storage  Registers: 
2  Data  (8  bytes) 

1  Address  (4  bytes) 

2  Index  (2  bytes) 

Total:  14  bytes 


a)  One  Buffer 


Cache  Element  Storage  Registers: 
1  Data  (4  bytes) 

1  Address  (4  bytes) 

1  Index  (1  bytes) 

Total:  9  bytes 


i 


■H  Cache  Element 


Bulk  Storage 
Module 


t 

b)  One  Cache  Element 


Figure  11.9  Comparison  of  Buffers  Versus  Cache  Elements 
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III.  BUTTERFLY  MACHINE  ARCHITECTURE 


A.  INTRODUCTION 

The  following  is  a  description  of  the  butterfly  machine  architecture.  Much  of  the 
material  is  summarized  from  previous  work  in  Loomis  [Ref  41]  and  Bernstein  [Ref  42], 
The  butterfly  machine  architecture  was  developed  to  provide  a  high-performance,  low- 
cost  solution  for  cyclostationary  processing  in  particular,  and  other  digital  signal 
processing  algorithms  that  lend  themselves  to  vector  operations  in  general.  The  butterfly 
machine  is  designed  to  perform  only  vector  processing  (i.e.,  no  scalar  operations).  To 
obtain  high  performance,  the  objective  is  to  approach  vector  computations  with  no  stalls 
in  the  pipeline. 

In  the  following  discussion,  the  term  radix-r  is  used  as  a  parameter  of  the  fast 
Fourier  transform  (FFT)  algorithm  as  described  by  Oppenheim  in  [Ref  43]  and  not  to  be 
confused  with  the  floating  point  representation  of  the  hardware.  The  value  of  r  indicates 
the  number  of  inputs  and  outputs  generated  with  a  single  butterfly  operation.  The  floating 
point  representation  is  not  discussed  but  assumed  to  be  32  bit  IEEE-754  format. 

VLSI  technology  has  made  it  possible  to  develop  specialized  digital  signal 
processing  (DSP)  chips  that  perform  FFT  butterfly  operations  for  a  variety  of  radices  in 
real  time  with  some  latency.  When  relatively  high  radices  are  used  compared  to  radix-2, 
FFTs  can  be  computed  at  substantially  faster  rates  than  are  possible  with  traditional 
processors.  These  processors  are  also  well  suited  for  performing  vector  operations  on 
data.  A  computer  architecture  composed  of  such  DSP  chips  can  compute  the  vast 
majority  of  operations  required  for  cyclostationary  algorithms.  An  architecture  is 
proposed  that  takes  advantage  of  these  specialized  DSP  chips  (referred  to  as  butterfly 
machines  (BFMs).  An  architecture  using  one  BFM  is  defined  and  is  referred  to  as  the 
one-chip  architecture.  An  implementation  of  the  cyclostationary  algorithm.  Strip  Spectral 
Correlation  Algorithm(SSCA)  is  illustrated  using  the  one-chip  architecture.  The  one-chip 
architecture  is  then  expanded  into  a  parallel  architecture.  Examples  of  this  type  of  chip 
technology  can  be  found  in  the  literature  [Ref  44],  [Ref  45],  [Ref  46],  [Ref  47]. 
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A  block  diagram  of  the  processing  environment  for  the  butterfly  machine  is 
shown  in  Figure  in.  1 .  The  host  computer  is  responsible  for  basic  process  coordination 
and  scalar  operations.  The  butterfly  machine  can  operate  in  two  modes.  In  the  first 
mode,  the  butterfly  machine  waits  for  requests  from  the  host  computer.  When  a  request  is 
received  from  the  host  computer,  the  butterfly  machine  responds  by  accepting  data, 
processing  the  data,  and  then  sending  the  processed  data  back  to  the  host.  The  input  data 
can  come  from  either  an  external  data  channel  referred  to  in  Figure  IH.l  as  the  input  data 
channel,  or  from  the  host  computer  via  the  system  bus. 


Figure  III.l  Butterfly  Machine  Environment 


In  the  second  mode,  the  butterfly  machine  performs  a  function  on  a  stream  of  data, 
sent  to  the  butterfly  machine  in  data  sets.  For  example,  the  data  could  originate  from 
sampled  data  from  the  input  data  channel.  The  resulting  processed  data  is  then  sent  to  the 
host  for  display  and  analysis.  The  butterfly  machine  program  is  provided  by  the  host  to 
the  butterfly  machine  via  the  host  computer  system  bus.  What  constitutes  a  program  for 
the  butterfly  machine  will  be  described  below. 
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There  is  a  traditional  tradeoff  between  specialization  of  hardware  and  the  scope  of 
functions  that  can  be  performed  by  the  architecture.  This  architecture  represents  a 
continuation  of  a  trend  toward  specialization  of  silicon  to  a  problem  domain.  Presently, 
there  are  several  chips  that  have  been  tailored  to  DSP  applications.  Notable  examples 
include  Intel's  i860,  Texas  Instrument's  TMS320C40,  and  Motorola's  96002.  The  design 
of  each  of  these  chips  reflect  tradeoffs  between  maximizing  performance  on  the  one  hand 
while  attempting  to  maximize  the  number  of  problems  that  they  can  address  effectively 
on  the  other.  An  example  of  a  highly  specialized  architecture  for  several  cyclostationary 
algorithms  may  be  found  in  Roberts  [Ref  5]. 

B.  BASIC  ARCHITECTURAL  CONCEPTS 

The  architectures  described  below  represent  an  additional  level  of  specialization, 
relative  to  the  DSP  chips  noted  above,  although  not  as  specialized  as  the  application- 
specific  architectures  described  in  Roberts  [Ref  5].  These  architectures  are  limited  to 
vector  operations  such  as  vector  multiply  or  add,  radix-r  butterflies,  and  the  dot  product 
of  two  vectors.  The  most  distinguishing  feature  of  architectures  incorporating  BFMs  is 
that  a  single  operation  type  (e.g.,  radix-2  butterfly)  is  performed  on  a  block  of  data. 
Further,  they  are  fully  pipelined  such  that  any  operation  can  be  completed  in  the  same 
number  of  cycles  as  there  are  resultants  to  be  stored  plus  latency. 

A  typical  BFM  architecture  is  shown  in  Figure  in.2.  For  each  pass,  the  BFM  is 
initialized  with  an  operation  code  (op  code)  and  data  flow  information.  Data  is  then 
streamed  through  the  BFM  from  an  input  buffer  to  an  output  buffer.  The  op  code 
specifies  the  particular  operation  to  the  performed  on  the  data.  Address  generators  (AGs) 
are  necessary  for  each  buffer  to  ensure  that  the  proper  data  is  passed  at  the  appropriate 
time.  The  AGs  receive  control  signals  from  the  controller  which  decodes  the  flow  control 
code  to  produce  these  signals.  Given  that  the  memories  can  service  references  at  the  clock 
speed  of  the  processor,  the  vectors  can  be  processed  efficiently. 

There  are  however,  two  sources  of  conflicts  that  can  diminish  the  efficiency  of 
this  highly  pipelined  architecture:  memory  conflicts  and  processor  conflicts.  The  timing 
diagram  of  Figure  in.3a  illustrates  a  vector  processor  that  flushes  the  processor  pipeline 
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prior  to  beginning  a  new  pass.  This  flushing  time  is  equal  to  the  latency  of  the  operation. 
In  Figure  in.3a,  D  cycles  are  required  to  flush  the  pipeline,  each  requiring  Tc  seconds. 

The  situation  where  a  processor  does  not  have  sufficient  resources  to  begin  a  new 
operation  without  completing  the  previous  operation  is  called  a  processor  conflict.  Figure 
in.3b  illustrates  the  performance  of  a  vector  processor  that  can  operate  without  processor 
conflicts. 


Figure  III.2  General  Vector  Machine  Architecture 
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Figure  III.3  Vector  Timing  Diagram 

Suppose  that  a  processor  contains  sufficient  resources  so  that  no  conflicts  will 
occur.  In  order  to  process  data  as  shown  in  Figure  in.3b,  the  memory  system  must 
provide  data  to  the  processor  at  the  appropriate  time.  The  situation  where  the  memory 
system  fails  to  process  memory  references  at  the  rate  required  by  the  processor  is  called  a 
memory  conflict.  Therefore,  conflict-free  operations  occur  only  when  both  the  processor 
has  sufficient  resources  to  avoid  flushing  the  pipeline  between  operations,  and  when  the 
memory  system  can  process  memory  references  as  the  processor  requests  them. 

The  effect  of  conflicts  on  performance  is  illustrated  in  Figure  in.3.  Each  pass  has 
associated  with  it  a  vector  of  length  N,  and  an  operation  with  an  associated  latency  of  D 
cycles.  Efficiency  of  the  pipeline  can  be  expressed  as  the  ratio  of  the  number  of  cycles 
required  with  no  latency  and  the  number  of  cycles  actually  required  for  a  given  operation 

N 


N  +  D 


(III.1) 


The  value  V  is  a  function  of  the  problem  domain  whereas  D  is  a  characteristic  of 
the  implementation  of  the  processor.  The  efficiency  is  clearly  related  to  the  ratio  of  N  and 
D.  A  reasonable  range  of  D  is  from  10  to  60  where  60  represents  the  latency  for  a  radix- 
16  operation.  Cyclostationary  algorithms  generally  operate  on  large  data  sets  as  large  as 
or  greater.  The  loss  in  efficiency  is  low  and  the  corresponding  simplicity  in  design  is 
significant  in  both  the  processor  and  memory  design  when  the  pipeline  is  flushed  between 
each  pass. 
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The  program  is  loaded  into  the  control  memory  through  the  input  data  channel. 
Vectors  of  data  are  then  sent  to  the  DSP  architecture  through  the  input  data  channel. 

Each  vector  is  processed  using  one  or  more  passes,  and  then  sent  to  the  output  data 
channel  through  the  output  port.  This  basic  architecture  and  the  concept  of  a  four  port 
device  in  particular,  is  borrowed  from  Array  Microsystems  [Ref  44]  and  Sharp  [Ref  45]. 

The  heart  of  a  program  for  the  BFM  is  a  list  of  passes.  A  pass  results  in  streaming 
a  set  of  data  from  one  buffer  to  another  through  the  BFM,  performing  some  operation.  A 
pass  is  defined  as  shown  in  Table  ni.l.  A  block,  the  organizational  unit  for  the  BFM 
software,  consists  of  a  list  of  passes  plus  an  input  specification  to  indicate  the  origin  of 
data  for  the  first  pass,  and  an  output  specification  that  states  where  to  send  the  resulting 
data.  Programs  are  constructed  by  stringing  blocks  together  and  through  the  use  of  super¬ 
blocks. 


pass  :=  source(s), 

destination, 

op  code 

source: 

buffer  id 

base  address 

address  sequence  type 

port  id 

destination: 

buffer  id 

base  address 

address  sequence  type 

port  id 

Table  III.l  Pass  Definition 


To  illustrate  a  simple  use  of  an  architecture  incorporating  BFMs,  consider  the 
computation  of  a  1024  point  (2'°)  FFT  with  the  architecture  illustrated  in  Figure  in.2. 
Assuming  that  radix-2  and  16  butterflies  are  available  in  the  BFM,  the  FFT  may  be 
computed  by  performing  one  radix-2  and  two  radix- 16  butterfly  passes  for  a  total  of  three 
passes  on  the  data.  The  definition  of  the  passes  for  this  example  is  contained  in 
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Table  in.2  and  illustrated  in  Figure  in.4  through  Figure  in.6.  In  Figure  111.4,  the 
operation  to  be  performed  is  a  radix-2  butterfly  beginning  with  data  in  buffer  A.  The 
input  vector  enters  the  processor  through  port  A,  is  streamed  through  the  processor,  and 
stored  in  buffer  B.  The  weighting  factors  are  supplied  from  the  coefficient  buffer  through 
port  C.  The  second  pass,  illustrated  in  Figure  in.5,  has  a  radix-16  operation  with  data 
now  in  buffer  B  and  streamed  back  to  buffer  A.  The  coefficient  buffer  serves  in  an 
analogous  role  but  for  radix- 16  weighting  factors.  A  second  radix- 16  pass  is  executed  in 
pass  three  to  complete  the  2’°  point  FFT  as  shown  in  Figure  in.6.  The  destination  buffer 
is  the  output  port  for  this  pass. 


Source(s) 


Source(s) 


Buf_A0,0,bit_rev,port_A 


Buf_Coef ,  1 024,radix2,port_C 


Destination  Buf_B,0,linear,port_B 


Buf_B0,0,const_geo,port_B 


Buf_Coef,0,radix  1 6,port_C 


Destination  Buf_A,0,linear,port_A 


Buf_A0,0,const_geo,port_A 


Buf_Coef,  1024, radix  1 6,port_C 


Destination  Buf_Out,0, linear, port_B 


Table  IIL2  Pass  Description 


Source(s) 
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Figure  IIL4  1024.Point  FFT:  Pass  1 


Figure  III.5  1024-Point  FFT:  Pass  2 
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Figure  in.6  1024-Point  FFT:  Pass  3 

A  timing  diagram  for  the  1024  point  FFT  is  shown  in  Figure  in.7.  Note  that  N  is 
the  number  of  elements  in  the  vector.  D2  and  D16  are  the  latencies  associated  with  the 
radix-2  and  16  operations  respectively.  Notice  that  it  indicates  that  input,  and  output  can 
be  overlapped  with  processing  keeping  the  processor  fully  utilized.  This  is  possible  by 
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Figure  IIL7  Timing  Characteristics  for  1024-Point  FFT 


carefully  selecting  the  ordering  of  buffers  between  which  the  data  is  passed.  The 
selection  depends  upon  whether  there  will  be  an  odd  or  even  number  of  passes. 
Additionally,  buffer  A  must  be  dual  ported  in  order  to  provide  the  overlap  of  input, 
processing,  and  output  indicated  in  Figure  in.2.  This  will  be  discussed  further  with  the 
one-chip  architecture. 


C.  PERFORMANCE  MEASURES 

There  are  several  performance  measures  that  are  appropriate  to  consider  when 
discussing  BFMs.  Factor  of  real  time,  Fj  is  defined  as  the  ratio  of  computation  time  to 

collect  time  and  represents  the  percentage  of  data  that  can  be  processed  given  that  data  is 
collected  continuously.  [Ref  48] 

_  _  Computation  Time  _  „  Phu 

^  Collect  Time  ~  Nf  ’  (  •  ) 

where 


Cu  is  the  number  of  computations  for  hardware  type  u, 
Phu  is  the  number  of  hardware  units  of  hardware  type  u, 
Tc  is  the  clock  interval, 

N  is  the  number  of  samples  taken,  and 
Ts  is  the  sample  interval. 

This  reduces  to 


CT 

Npif/ 


(IIL3) 


when  there  is  only  one  computational  hardware  resource  type  as  is  the  case  for  BFMs. 
The  computation  time  for  BFMs  is  defined  as  the  sum  of  the  product  of  the  number  of 
passes  and  the  pass  length,  for  each  type  of  operation. 

Efficiency  of  a  parallel  architecture  is  defined  as 

(in.4) 

^Usedk 
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where 


C«„.  =  ^ ,  (in.5) 

and  CReqj  is  the  number  of  processing  cycles  required  to  compute  a  function  with  a  single 
BFM  and  Cused  is  the  number  of  cycles  actually  used  by  a  /:-processor  BFM. 

D.  FAST  FOURIER  TRANSFORM 

The  set  of  fast  Fourier  transform  (FFT)  algorithms  selected  for  the  butterfly 
machine  architecture  is  those  that  are  developed  using  decimation-in-frequency.  The 
butterfly  machine  architecture  includes  radix-2,  4,  and  16  butterflies.  Butterflies  with 
radices  with  powers  of  two  have  been  selected  for  their  efficient  implementation  gained 
though  algorithmic  techniques.  Further,  the  radix-4  butterfly  has  a  straightforward 
hardware  implementation  due  to  the  fact  that  the  complex  exponential  takes  on  values  of 
±1  and  ±j,  which  allows  the  use  of  hardware  addition  in  place  of  multiplication  in  some 
cases.  Butterflies  of  radix-2  are  supported  to  allow  FFTs  of  any  vector  of  length  2*. 

Two  early  works  concerning  implementation  of  the  FFT  can  be  found  in  Singleton 
[Ref  49]  and  Pease  [Ref  50].  The  decimation-in-frequency  algorithm  discussed  below  is 
described  in  Oppenheim  [Ref  51]  for  a  radix-2  butterfly.  Figure  in.8  is  a  signal  flow 
graph  for  the  decimation-in-frequency  algorithm  for  an  eight  point  vector.  Although  this 
algorithm  can  be  implemented  for  the  butterfly  machine  architecture,  the  addressing 
reference  stream  causes  two  problems.  First,  the  radix-2  butterfly  pattern  varies  from 
pass  to  pass.  In  the  first  radix-2  pass,  the  butterfly  indices  are  separated  by  four  (e.g.,  x(0) 
and  x(4)).  In  the  second  and  third  passes,  the  indices  are  separated  by  two  and  one 
respectively.  This  variation  in  address  patterns  must  take  into  account  by  the  address 
generators  (see  Figure  III.2).  Second,  the  analysis  of  memory  performance  is  made  more 
difficult  by  the  address  pattern  changes  from  pass  to  pass. 

Both  problems  are  simplified  by  replacing  the  in-place  signal  flow  graph  with  a 
constant  geometry  signal  flow  graph.  The  corresponding  constant  geometry  signal  flow 
graph  for  Figure  III.8  is  shown  in  Figure  in.9. 
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The  following  demonstrates  decomposition  with  decimation-in-frequency  for  a 
radix-4  butterfly.  For  a  sequence 


x[n]  n  =  0,l,  ...  iV-1,  (III.6) 

the  Discrete  Fourier  Transform  (DFT)  is  defined  as 


N-\ 


X[fc]  =  £x[n]w;*  k  =  QX...N-\ 


(111.7) 


n=0 


where 


-} 


N 


:2k 

N 


(III.8) 


The  sequence  x[n]  is  partitioned  into  the  number  of  sets  equal  to  the  radix  number.  For  a 
radix-4  butterfly,  Equation  (in.7)  becomes: 


(111.9) 


A  change  of  variables  in  the  second,  third,  and  forth  summations  yields 


%-i 


XW= + 1  *[»+f ]<**'*  + 


%- 


(n+f)* 


%- 


n=0 


n=0 


n=0 


n=0 


(IILIO) 

Moving  the  parts  of  the  weighting  factors  that  are  not  dependent  on  the  summations  and 
using  Equation  (111.8)  yields 

X[k]=  £  +  W'/  . 

n=0  n=0  n=0  n=0 

(IILll) 

Consider  the  following  four  sets  of  X[A:]  such  that  k-4r,k  =  4r  +  l,k  =  4r  +  2, 
N 

and  A:  =  4r  -I-  3  for  r  =  0, 1,  •  •  —  1 .  Substituting  these  values  of  k  into  Equation  (El.  1 1 ) 

and  again  using  Equation  (III.8)  yields  the  following  four  equations  for  X[jfc]: 
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(III.12) 


%-i 

^[4^ = S  + t] + + ^] + A^ + 

«=0 

%-\ 

X[4r  +  1]=  ^[x[n]-jx[n  +  f\-x[n  +  f]+jx[n  +  ^]]w;;  W:;^  (111.13) 

n=0 

%-\ 

X[4r  +  2]=j^[x[n]-x[n  +  i]  +  x[n  +  ^]-x[n  +  ^]]W^''  (III.14) 

/i=0 

y4-i 

X[4r  +  3]=  Y,[x[n]  +  jx[n  +  i]-x[n  +  i]-jx[n  +  ^]]W^''  W".  (III.15) 

n=0 

Figure  IILIO  illustrates  the  use  of  Equations(in.l2)  through  (in.15)  to  compute  in 
part,  an  FFT  using  a  radix-4  butterfly.  The  eight  point  vector  is  passed  through  a  radix-4 
followed  by  a  radix-2  butterfly.  The  second  radix-4  butterfly  is  not  shown  for  clarity. 
The  constant-geometry  version  is  constructed  in  an  analogous  manner  as  the  radix-2 
version  shown  above. 
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E.  PERMUTATION-BASED  MEMORY  DECODING  SCHEME 

The  memory  decoding  scheme  selected  for  the  butterfly  machine  architecture  is 
permutation  based,  as  described  in  Section  D  of  Chapter  II.  This  selection  is  based  on 
finding  the  least  expensive  implementation  that  provides  excellent  throughput  for  the 
memory  system.  Conventional  memory  decoding  performs  poorly  for  algorithms  that 
have  a  characteristic  of  powers  of  two.  These  algorithms  include  radix-r  butterflies 
where  r=2^  and  the  digit-reversed  patterns  which  are  required  for  the  last  pass  of  the  FFT, 
as  described  in  Section  D  above.  The  1-Skew  and  prime  number  memory  decoding 
schemes  are  more  complex  to  implement  than  the  permutation-based  method.  Also,  most 
prime  number  decoding  schemes  do  not  use  all  of  the  physical  memory,  as  indicated  in 
Section  D  of  Chapter  H. 

The  following  discussion  of  permutation-based  memory  decoding  will  first 
describe  a  set  of  constraints  necessary  to  construct  a  memory  decoding  scheme  that  yields 
a  valid  interleaved  memory  system.  Then,  a  set  of  constraints  will  be  described  that 
yields  the  desirable  properties  for  a  memory  used  in  the  butterfly  machine  architecture.  A 
specific  permutation  matrix  will  then  be  constructed  that  is  designed  for  the  butterfly 
machine  architecture. 

First,  terminology  concerning  permutation-based  memory  decoding  will  be 
established.  The  address  space  contains  2^  words  indexed  0..  .2^  - 1.  A  binary  address 
is  written  ...a,ao  where  the  most  significant  bit  (MSB)  of  the  address  has  an 

index  of  N-\  and  the  least  significant  bit  has  an  index  of  0.  An  interleaved  memory 
system  contains  B  banks,  where  each  bank  contains  a  total  of  K  words  indexed  in  the 
conventional  manner  0  through  K-\.  Therefore,  the  number  of  bits  required  to  specify  a 
bank  number  is 

n  =  \og^{B)  (IIL16) 

and  the  number  of  bits  required  to  specify  the  index  into  a  memory  bank  is 

k^\og,{K).  (111.17) 

The  number  of  bits  for  the  memory  address  space  is  then 
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R  =  n  +  k. 


(III.18) 


The  memory  decoding  scheme  must  specify  a  bank  number  and  an  index  into  the 
bank.  This  index  will  be  referred  to  as  the  bank  index.  In  general,  the  bank  number  is 
specified  with  the  following  matrix  equation: 


b  =  P  a  (111.20) 

when  shorthand  notation  is  appropriate. 

The  resulting  vector  b  is  a  binary  representation  of  the  bank  number.  The  vector 
a  represents  the  r  LSBs  of  the  address  used  to  decode  the  bank  number.  Note  that 

n<r<R.  (IIL21) 

Entries  in  the  P  matrix  are  either  a  1  or  a  0.  A  bit  bi  of  the  bank  number  is  a  result 
of  the  normal  dot  product  of  the  ith  row  of  matrix  P  and  the  a  column  address  vector 
except  that  the  multiplications  are  logical  ANDs  and  the  summation  is  a  logical  exclusive 
OR.  The  ith  bit  of  the  bank  number  can  be  written  as 

bi  =  (Pi.O  ■  ®  {Pi.l  •  )©...©  (Pi, r-2  •  «,  )  ©  (Pi.r-l  '  «0  )  (III.22) 

revealing  that  each  bit  of  the  bank  number  can  be  thought  of  as  an  encoding  based  on  the 
parity  of  selected  bits  of  the  address  as  determined  by  the  P  matrix  entries  that  are  equal 
to  1. 

To  verify  that  permutation-based  memory  decoding  provides  a  valid  memory  map, 
it  will  first  be  shown  that  conventional  memory  decoding,  which  is  a  valid  memory  map 
by  inspection,  is  a  subset  of  permutation-based  memory  decoding.  Then,  variations  of 
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this  equivalent  conventional  memory  decoding  scheme  will  be  explored  to  determine  the 
constraints  on  the  permutation  matrix  that  are  required  to  ensure  a  valid  memory  map. 


As  an  initial  point  of  reference  for  analysis  of  permutation  memory  decoding,  note 
that  if  the  permutation  matrix  P  is  the  identity  matrix  with  dimensions  n  by  n,  then  the 
resulting  permutation-based  scheme  is  equivalent  to  conventional  memory  decoding. 
Further,  the  bank  index  is  computed  by  directly  using  the  most  significant  k=R-n  bits. 

This  decoding  scheme  for  the  bank  index  is  assumed  for  the  remainder  of  the  document. 

The  permutation  equation  for  computing  the  bank  number,  which  is  equivalent  to 
conventional  decoding,  is  shown  below  for  a  bank  number  with  n  bits. 


^n-l 


^n-2 


^1-1 


*n-2 


a, 


^0  J 


(IIL23) 


It  is  useful  to  organize  the  linear  address  space  into  equally-sized  2"  blocks,  where 
each  block  begins  at  /  •  2"  for  /  =  0 . . .  2''""  - 1 .  For  conventional  memory  decoding,  the  k 
MSBs  specify  a  memory  location  within  a  bank  and  the  n  LSBs  specify  the  bank  as 
indicated  in  Equations  (111.16)  through  (HI.  18).  Therefore,  for  conventional  decoding  for 
an  arbitrary  fixed  index,  the  sequence  0. .  .2"  - 1  on  the  n  LSBs  maps  one-to-one  and  onto 
the  set  of  bank  numbers.  This  sequence  0.  ..2"  - 1  will  be  referred  to  as  the  base 
sequence.  Since  this  one-to-one  mapping  is  valid  for  all  blocks,  the  mapping  from  the 
linear  address  space  to  the  bank  number,  bank  index  pair  space  is  also  one-to-one  and 
onto.  The  practical  implication  is  that  all  of  the  capacity  of  the  memory  hardware  is 
utilized  and  the  decoding  scheme  is  valid  for  a  memory  system. 


Now,  consider  any  change  to  the  P  matrix  specified  above  such  that  the 
dimension  of  P  is  unchanged  and  the  nonsingularity  of  the  matrix  is  maintained.  A 
modified  but  nonsingular  matrix  P  of  equal  dimension  will  yield  a  different  mapping 
(i.e.,  it  will  not  be  the  identity  mapping),  but  will  still  map  the  base  sequence  one-to-one 


57 


and  onto  for  each  block.  Therefore,  any  such  matrix  P  will  produce  a  valid  memory  that 
utilizes  all  of  the  memory.  Also,  any  such  matrix  P  generates  the  desired  round  robin 
pattern  for  a  sequential  memory  reference  stream  for  an  interleaved  memory  system  since 
each  bank  is  selected  exactly  once  for  each  base  sequence. 


Now  consider  a  matrix  P  of  dimension  nby  R  (i.e.,  use  possibly  all  of  the  address 
bits  to  generate  the  bank  number)  which  is  constructed  by  concatenating  columns  to  the 
left  of  matrix  P,  described  in  the  previous  paragraph.  All  of  the  values  in  each  of  the  new 
columns  are  assumed  to  be  0  except  for  a  single  1  in  an  arbitrary  /th  row  and  jth  column 
as  shown  in  Equation  (El.  24). 
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The  identity  matrix  is  used  for  illustration  in  Equation  (III.24)  however,  the 
comments  that  follow  apply  equally  to  any  matrix  P  where  the  identity  sub-matrix  is 
replaced  with  a  nonsingular  sub-matrix  of  dimension  n  by  n. 

Consider  the  effect  of  p-  j  =  1  on  bank  selection  for  a  base  sequence.  The  address 
bit  Uj  that  corresponds  to  p.  j  in  the  P  matrix  is  for  a  given  address,  either  a  1  or  a  0. 

When  it  is  a  0,  it  has  no  effect  on  the  bank  selection  (i.e.,  the  bank  number  is  the  same 
number  that  would  have  been  computed  if  p.  j  =  0 .  When  the  address  bit  =  1 ,  the  ith 

bank  bit,  fc,. ,  is  complemented  from  what  it  was  when  aj  =  0  (or  when  p..=0).  This 

results  in  a  permutation  of  the  bank  numbers  generated  for  a  base  sequence.  This 
permutation  can  be  illustrated  by  listing  the  original  bank  numbers  generated  when  the 
address  bit  =  0  as  a  table.  Each  row  in  the  table  is  a  bank  number  and  the  columns 

represent  bit  positions  for  the  bank  numbers.  The  permuted  set  of  bank  numbers  is  found 
by  complimenting  the  ith  column  of  the  table.  Clearly  the  new  map  is  still  one-to-one  and 
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onto  for  each  base  sequence.  Since  this  is  valid  for  all  blocks,  the  mapping  represented 
by  the  bank  decoding  scheme  indicated  in  Equation  (III.24)  is  one-to-one  and  onto. 

Recall  that  the  element  of  the  P  matrix  j  =  1  was  selected  for  an  arbitrary  i  and 
j.  Further,  once  the  mapping  resulting  from  adding  a  1  at  p.  j  is  established  to  be  a  one- 

to-one  and  onto  mapping,  the  P  matrix  can  be  modified  again  by  selecting  another 
element  p.  j  =  1 .  The  P  matrix  continues  to  provide  a  one-to-one  and  onto  mapping 

based  on  the  rational  used  for  p.  j .  Other  elements  on  the  left-hand  side  of  the  P  matrix 

can  be  modified  as  desired  while  maintaining  a  memory  decoding  scheme  that  is  one-to- 
one  and  onto.  In  summary,  it  can  be  seen  that  as  long  as  a  nonsingular  sub-matrix  of 
dimension  n  by  n  is  positioned  in  the  far  right  columns  of  the  permutation  matrix  P,  other 
elements  of  the  matrix  P  can  be  modified  in  an  arbitrary  fashion  while  maintaining  a 
valid  memory  decoding  scheme  which  also  utilizes  all  of  the  physical  memory. 

The  next  discussion  will  describe  a  set  of  constraints  that  ensure  that  a 
permutation  matrix  will  provide  maximum  throughput  for  address  patterns  with  constant 
stride  s  such  that 

^  -  2*'  (111.25) 

where  v  is  an  integer,  greater  than  or  equal  to  zero. 

As  indicated  above.  Equation  (in.23)  generates  a  round  robin  pattern  of  bank 
numbers  by  selecting  each  bank  once  within  a  base.  Also  note  that  the  base  sequence  is 
an  address  sequence  of  constant  stride  of  one.  This  is  true  for  any  matrix  P  such  that  the 
sub-matrix  has  dimension  nhy  n  and  is  nonsingular. 

Now  consider  Table  III.3.  Table  ni.3  contains  the  decimal  and  binary 
representation  of  the  counting  sequence  0. . .  15 .  It  is  easily  verified  that  the  value  of  the 
LSB  (bo  in  Table  ni.3)  does  not  change  for  a  sequence  with  constant  stride  of  two. 
Further,  if  the  bo  column  of  bits  is  removed  from  the  counting  sequence  generated  with  a 
constant  stride  of  two  and  the  resulting  columns  are  relabeled  such  that  bi  is  labeled  bi-i, 
for  each  i,  then  the  result  is  a  sequence  equivalent  to  the  original  sequence  (i.e.,  a 
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sequence  with  constant  stride  of  one).  Therefore,  the  LSB  does  not  contribute  to  the 
selection  of  a  bank  when  the  stride  is  two  and  the  resulting  sequence  is  a  sequence  with 
stride  of  one  when  ignoring  the  column  of  the  least  significant  bit.  A  similar  set  of 
statements  can  be  made  for  other  strides  of  powers  of  two.  In  general,  a  stride  s  such  that 

5  =  2*"  fc>0  an  integer  (III.26) 

will  not  engage  the  kth  LSBs  and  the  resulting  pattern  when  removing  the  k  right  most 
columns  will  yield  a  sequence  with  stride  of  one. 

Using  Equation  (in.24)  as  a  point  of  reference,  it  is  desired  to  modify  the  P 
matrix  such  that  address  patterns  with  strides  of  two  map  to  all  of  the  banks  within  a  base 
sequence.  It  is  clear  that  if  the  identity  sub-matrix  in  Equation  (in.24)  is  shifted  one 
position  to  the  left,  then  the  base  sequence  would  map  directly  into  the  bank  numbers,  as 
is  the  case  for  a  base  sequence  in  Equation  (in.23).  Further,  if  the  shifted  identity  sub¬ 
matrix  were  modified,  but  its  order  maintained  and  remained  nonsingular,  then  the  base 
sequence  would  continue  to  map  one-to-one  and  onto  the  banks.  This  would  naturally 
destroy  the  desired  pattern  for  the  constant  stride  of  one.  Therefore,  the  task  is  to  find 
modifications  to  the  matrix  P  that  will  preserve  the  desirable  properties  of  a  constant 
stride  of  one  while  enhancing  the  matrix  P  to  accommodate  address  patterns  with 
constant  stride  of  two.  In  general,  the  objective  is  to  find  a  technique  for  enhancing  a 
matrix  P  that  can  accommodate  constant  strides  up  to  a  stride  of  2',  such  that  the  matrix 
can  also  accommodate  strides  up  to  2'^’ . 

Consider  the  matrix  Equation  (in.27)  where  the  sub-matrix  P®^„  is  dimension  n 
by  n  and  nonsingular.  In  the  following  discussion,  sub-matrices  P^^^  for  various  values 
of  V  will  be  defined.  In  all  cases,  the  dimension  of  these  sub-matrices  is  n  by  n  and  the 
subscript  will  be  dropped  in  the  text  for  brevity. 
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Table  III.3  Binary  Counting  Sequence 


Based  on  the  earlier  results,  the  permutation  matrix  P  in  Equation  (111.27)  provides  a 
proper  mapping  to  bank  numbers  for  a  valid  memory.  The  sub-matrix  P®  also  provides 


for  a  one-to-one  mapping  of  the  base  sequence,  and  therefore  address  patterns,  with  a 
stride  of  one  (2®)  to  bank  numbers.  It  is  constructed  with  the  R-n-\  through  i?  - 1 


61 


columns  of  the  matrix  P.  Consider  a  new  sub-matrix,  with  dimension  n  by  n, 
beginning  in  column  R-n-2  and  ending  in  column  i?  - 2  as  shown  in  Equation 


(m.28). 
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None  of  the  values  of  matrix  P  have  changed  from  Equation  (III.27)  to  Equation  (III.28). 
The  sub-matrix  P^  is  defined  to  be  the  sub-matrix  consisting  of  columns  R-n-2 

through  /?  -  2  of  matrix  P.  Clearly  P^  is  singular  since  the  first  column  ( /?  -  n  -  2  th 
column  of  matrix  P)  is  zero.  However,  if  the  first  column  is  modified  (i.e.,  replacing  a  0 
with  a  1  in  one  or  more  rows  of  column  R-n-2)  such  that  P^  is  nonsingular,  then 

address  patterns  of  constant  stride  of  two  (2‘)  will  map  one-to-one  and  onto  the  banks. 


Consider  the  effect  of  replacing  a  0  with  a  1  in  row  i,  column  R-n-2  of  matrix 
P,  on  an  address  pattern  with  a  constant  stride  of  one.  When  the  a.  =  0 ,  the  base 

sequence  mapping  is  unchanged.  When  the  a,  s_„_2  =  1 ,  the  ith  bit  of  the  bank  number  is 
complemented,  resulting  in  a  new  mapping  but  a  mapping  that  is  still  one-to-one  and 
onto.  Since  has  one  more  bit  of  significance  than  the  sub-matrix  V^xn’ 
sequence  will  be  mapped  with  a^_„_2  =  0  followed  by  one  base  sequence  mapped  with 
<^R-n-2  =  1  •  This  pattern  will  be  repeated  for  address  patterns  greater  than  twice  the  base 
sequence  length. 

The  process  for  constructing  P*  can  be  repeated  for  P^,  P^,  ...  p^~”  such  that 
each  sub-matrix  P'  is  nonsingular.  Constmction  of  each  sub-matrix  P'  is  accomplished 
as  described  for  P^  to  support  address  patterns  of  constant  stride  of  2' . 
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The  effect  of  constructing  P*  on  address  sequences  of  constant  stride  of  2'’  where 
V  =  0. .  .i  - 1  will  now  be  described.  Construction  of  P'  involves  modification  of  column 
R-n-i-\  which  corresponds  to  the  address  bit  .  There  are  2"'^'  addresses  where 
a„+i  =  0  followed  by  2"^'  addresses  where  -  1 .  This  pattern  repeats  if  the  address 
sequence  is  longer  than  2"^'“^' .  For  address  patterns  with  constant  stride  of  one  (2°) , 

=  0  for  2'  base  sequences  (i.e.,  2'  base  sequences  are  completed  while  is 
constant).  For  address  patterns  with  constant  stride  of  two  (2' ) ,  =  0  for  2'“’  base 

sequences  followed  by  =  1  for  2'“'  base  sequences.  In  general,  for  address  patterns 
with  constant  stride  of  2'' ,  =  0  for  2'”'’  base  sequences  followed  by  a„^,.  =  1  for  2'“'' 

base  sequences. 

The  effect  of  a  set  of  ones  located  at  j  for  various  i  and  j  in  the  P  matrix  is 

cumulative.  In  general,  the  row  positions  dictate  the  bit  position(s)  of  the  bank  number  to 
be  complimented.  The  mapping  is  unique  to  the  row  number  or  set  of  row  numbers  (e.g., 
a  one  in  the  ith  row  will  generate  a  different  map  than  a  one  in  the  kth  row.  A  one  in  both 
the  ith  and  ^h  row  is  a  third  mapping).  The  column  number  determines  how  many  base 
sequences  will  be  spanned  for  a  constant  value  of  the  address  bit  ,  as  described  in  the 

previous  paragraph. 

Figure  ni.l  1  and  Figure  HI.  12  illustrate  the  address  pattern  generated,  given  the 
permutation  matrix  P  shown  below: 
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Figure  HI.  1 1  illustrates  the  mapping  generated  by  the  nonzero  bits  on  the  left-hand 
side  of  the  dashed  line  of  matrix  P  in  Equation  (in.29).  The  top  box  of  Figure  HI.  1 1 
represents  the  bank  pattern  resulting  from  the  base  sequence  given  that  all  of  the  elements 
on  the  left-hand  side  of  the  dashed  line  in  Equation  (in.29)  are  zeros.  The  effect  of 
adding  the  element  labeled  1^  is  to  first  generate  the  sequence  of  bank  numbers  generated 
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by  the  base  sequence  (this  occurs  when  the  address  bit  corresponding  to  1^  is  zero)  alone 
followed  by  a  permutation  of  the  base  sequence  labeled  mapping  #1  in  Figure  TTT 1 1 
Adding  the  element  marked  1,,  results  in  a  sequence  consisting  of  the  base  sequence 
followed  by  the  mapping  #1  sequence  (address  bit  corresponding  to  is  zero)  followed 

by  a  permutation  of  the  base  /mapping  #1  sequence  referred  to  as  mapping  #2  in  the 
figure. 

The  mapping  generated  as  a  result  of  the  two  elements  in  Equation  (III.29)  labeled 
1^  is  similar  to  that  described  above  for  la  and  lb.  First,  the  set  containing  the  base 
sequence  concatenated  with  the  mapping  #1  sequence,  concatenated  with  mapping  #2  is 
generated.  A  permuted  version  of  this  sequence  is  then  passed  and  labeled  mapping  #3. 

Figure  ni.  12  provides  a  comparative  illustration  of  the  sequences  generated  by  the 
nonzero  elements  in  Equation  (111.29).  Figure  HI.  12a)  reflects  the  base  sequence  pattern 
of  bank  numbers,  given  that  all  of  the  sub-matrix  on  the  left-hand  side  is  zero.  This 
results  in  the  repetition  of  the  block  of  base  sequence  bank  numbers  The  bank  number 
pattern  shown  in  Figure  m.  12b)  illustrates  the  effect  of  adding  1^  to  the  matrix.  Figure 
III.12c)  and  d)  reflect  the  accumulative  effect  of  adding  1^  and  1^  respectively  to  the 
bank  number  pattern. 

The  simulation  runs,  based  on  permutation-based  memory  decoding  described  in 
Chapter  n,  are  based  on  the  following  specifications  for  the  interleaved  memory  system: 

•  Number  of  Banks:  4,  8,  16  and  32. 

•  Linear  Memory  Space:  0 . . .  2^‘*  - 1 . 

•  The  permutation  matrices  used  in  the  simulations  for  all  address  patterns 
except  radix-r  butterflies  are  shown  in  Figure  111.13  through  Figure  ni.16. 
Permutation  matrices  for  radix-r  butterfly  patterns  are  described  in  Chapter 
V. 

In  this  section,  permutation  matrices  are  described  in  detail.  Requirements 
sufficient  to  ensure  a  valid  memory  map  was  described.  In  particular,  if  the  rightmost  n 
by  n  sub-matrix  of  the  permutation  matrix  is  nonsingular,  then  the  permutation  matrix 
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generates  a  valid  memory  mapping.  Further,  if  the  n  by  n  sub-matrix  identified  as  P'  is 
nonsingular,  then  address  patterns  with  stride  =  2'  will  produce  a  bank  selection  pattern 
that  is  near  ideal  for  an  interleaved  memory  system.  The  following  section  will  present 
the  high-level  architecture  for  a  single  vector  processor. 


Figure  III.ll  Permutation  Address  Pattern  Maps 
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Figure  111.12  Comparison  of  Permutation  Address  Patterns 
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Figure  III.13  Simulation  Permutation  Matrix: 
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Figure  III.14  Simulation  Permutation  Matrix:  NoBanks=8 
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Figure  III.15  Simulation  Permutation  Matrix:  NoBanks=16 
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Figure  III.16  Simulation  Permutation  Matrix:  NoBanks=32 

F.  ONE-CHIP  ARCHITECTURE 

The  one-chip  BFM  architecture  is  illustrated  in  Figure  111.17.  This  architecture 
consists  of  a  single  BFM,  six  memory  buffers,  and  data  multiplexers  to  control  data  flow. 
Not  shown  explicitly  are  address  generators,  necessary  for  each  memory.  Buffers  AO,  Al, 
BO,  and  Bl,  and  the  auxiliary  buffer  serve  as  data  sources  and  destinations.  The 
coefficient  buffer  contains  any  constants  required  by  the  function  executed  such  as 
weighting  factors  for  radix-r  butterfly  operations,  windowing  data,  frequency  down 
conversion  data,  etc. 

Control  for  accepting  data  from  the  input  data  channel,  sending  data  out  onto  the 
output  data  channel,  and  computation  by  the  BFM  are  independent.  The  basic  model  for 
communicating  data  is  message  passing.  This  will  be  described  in  more  detail  when 
discussing  the  parallel  architecture. 

The  one-chip  BFM  will  be  used  to  execute  the  SSCA  in  the  discussion  below. 

The  functional  diagram  for  the  SSCA  is  shown  in  Figure  HI.  18.  Three  basic  blocks  are 
required  for  the  SSCA,  namely  channelization,  correlation  multiply,  and  back-end  N  FFT. 
For  illustration  purposes,  it  is  assumed  that  N'  =  2^  and  A  =  2^’.  Channelization  will 
require  one  pass  for  windowing,  one  radix-2  and  one  radix- 16  butterfly  pass  for  the  32- 
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Figure  III.17  One-Chip  Architecture 


Figure  111.18  SSCA  Functional  Diagram 


point  N'  FFT,  and  one  pass  for  down  conversion.  This  channelization  block  must  be 
performed  P=4A// A'  times. 

In  order  to  maintain  overlap  of  the  input  and  output  with  the  processing,  input  will 
alternate  between  buffer  AO  and  buffer  A1 .  Assuming  that  input  data  is  in  buffer  AO, 
data  flow  for  the  channelization  passes  is  as  shown  in  Figure  111.19  (i.e.,  data  moves  from 


buffer  AO  to  the  auxiliary  buffer  and  back  to  buffer  AO,  then  to  buffer  B  and  finally  to 
auxiliary  buffer).  Note  that  if  the  number  of  passes  in  channelization  were  odd,  the  data 
path  would  be  from  buffer  AO  to  buffer  B,  and  then  to  the  auxiliary  buffer. 
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Figure  IIL19  SSCA  Execution:  Channelization 

The  second  block,  correlation  multiply,  consists  of  either  a  single  pass  of  length  L, 
the  decimation  factor,  a  total  of  P  times  or,  a  single  pass  of  length  N  a  total  of  N'  times. 
The  latter  is  the  method  of  choice  for  the  one-chip  architecture  since  it  results  in  longer 
but  fewer  passes.  The  former  is  the  better  choice  for  the  parallel  architecture  since  these 
correlation  multiplies  can  be  accomplished  incrementally  by  the  back-end  as  each  of  the 
N'  samples  are  passed  from  the  channelizer.  The  correlation  multiply  pass  is  illustrated 
in  Figure  111.20.  Observe  that  the  original  N  data  samples  are  used  from  buffer  AO. 

The  third  block,  the  back-end  A-point  FFT,  is  computed  with  one  radix-2  and  four 
radix- 16  butterfly  passes  as  shown  in  Figure  ni.21.  Data  is  ping-ponged  between  buffer 
B  and  the  auxiliary  buffer  such  that  the  final  result  is  located  in  buffer  B. 
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Figure  111.20  SSCA  Execution:  Correlation  Multiply 


Figure  III.21  SSCA  Execution:  N  FFT 


Observe  that  it  is  possible  to  chose  appropriate  paths  such  that  the  final  result  is  in  buffer 
B.  If  the  number  of  passes  were  odd,  one  pass  would  be  made  to  buffer  AO  (e.g.,  for  three 
passes,  from  buffer  B  to  buffer  AO,  to  the  auxiliary  buffer,  and  finally  back  to  buffer  B. 

A  super  block  is  used  for  each  of  these  basic  blocks  in  order  to  repeat  each  block 
the  appropriate  number  of  times. 

The  number  of  cycles  necessary  to  compute  the  SSCA,  without  taking  into 
account  latency,  using  a  BFM  is 

N'N  N'N 

^BFM  =— +8A'  +  4Wlog|^W'  +  -^log,gW.  (III.30) 

The  number  of  cycles  necessary  to  compute  the  SSCA  with  a  conventional 
processor  that  is  fully  pipelined  (i.e.,  each  addition  and  multiplication  can  be 
accomplished  in  a  single  cycle)  is 

'^NN' 

Cgpp  =  iV(A^'  +  4)  +  (12iV)log2  yV'  +  ^^log  N.  (III.31) 

Computation  of  the  total  number  of  cycles  for  BFM  and  a  fully  pipelined  general- 
purpose  processor  for  N'  varying  between  2"^  and  2^^  and  N  varying  between  2’^  and  iP- 
reveals  that  there  is  approximately  a  18  to  1  processing  gain  obtained  with  the  BFM, 
relative  to  the  fully  pipelined  general-purpose  processor.  Note  that  the  expression  for  a 
fully  pipelined  processor  is  a  theoretical  upper  bound.  The  ratio  for  an  actual  general- 
purpose  processor  would  be  much  higher.  This  factor  reflects  the  parallelism  inherent  in 
the  butterfly  processor. 

G.  PARALLEL  ARCHITECTURE 

One  board  of  the  parallel  architecture  is  shown  in  Figure  ni.22.  The  parallel 
architecture  consists  of  two  or  more  of  these  boards  with  a  common  input  data  channel, 
cross  data  channel  and  output  data  channel.  A  single  board  of  this  architecture  is  similar 
to  that  of  a  one-chip  architecture  with  the  addition  of  the  cross  data  bus  for  inter-BFM 
communications.  This  parallel  architecture  represents  a  tradeoff  between  programming 
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flexibility  and  performance.  A  BFM  architecture  that  is  optimized  for  the  SSCA  is 
described  in  Loomis  [Ref  41]. 

Each  processor  has  an  independent  clock  for  each  bus  and  processor.  Data 
communication  is  accomplished  with  a  message  passing  scheme.  A  message  consists  of  a 
control  packet  consisting  of  a  message  id,  message  type,  data  packet  length,  the  number 
of  additional  parameters,  and  the  additional  parameters.  When  a  processor  is  ready  to 
send  data  to  a  processor,  it  first  sends  the  control  packet.  If  the  message  is  accepted,  a 
ready  to  receive  signal  is  passed  back  and  the  data  transfer  begins.  The  two  types  of 
transfers  possible  on  the  bus  are  "one  to  one"  and  "one  to  many". 


Figure  111.22  Parallel  Architecture  (One  Board) 


The  relative  number  of  computations  required  for  channelization  versus  the 
correlation  multiplies  and  ATFFTs  varies  considerably  with  the  input  parameters  N'  and 
N.  The  ratio  of  the  number  of  cycles  required  for  the  back-end  (i.e.,  the  correlation 
multiply  and  the  N  FFT)  versus  channelization  is 
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(III.32) 


8W+4W[log„(W')l 

and  is  shown  in  Figure  111.23  for  various  N'  and  N.  Processors  are  allocated  statically 
based  on  this  ratio. 


Figure  111.23  Process  Allocation:  Ratio  of  Backend  to  Channelizer  Cycles 


Execution  of  SSCA  using  the  parallel  architecture  is  similar  to  that  of  the  one- 
chip  architecture  except  that  blocks  must  be  allocated  to  processors.  The  simplest  scheme 
based  on  the  data  shown  in  Figure  ni.23  is  to  dedicate  one  BFM  to  channelization  and  the 
remainder  to  the  back-end  processing.  The  efficiency  obtained  using  this  approach  is 
illustrated  for  a  ten  processor  system  in  Figure  ni.24. 
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IV.  DESCRIPTION  OF  SPLIT  TRANSACTION  MEMORY 


A.  PHYSICAL  DESCRIPTION 

Split  Transaction  Memory  (STM)  is  a  memory  architecture  that  is  designed  to 
support  a  vector  processing  architecture  with  a  throughput  that  approaches  one,  as  defined  in 
Equation  (11.4).  STM  takes  advantage  of  addressing  patterns  that  are  characteristic  of 
constant  stride.  Of  particular  importance  is  the  ability  of  STM  to  support  radix-r  and  digit 
reversal  address  patterns  where  r  is  a  power  of  two. 

STM  provides  better  throughput  than  cached  memory  because  it  takes  advantage  of 
the  predominate  characteristic  found  in  the  butterfly  machine  architecture:  patterns  of 
constant  stride  and  particularly  constant  strides  of  powers  of  two.  Although  the  memory 
reference  patterns  exhibit  some  locality  of  reference,  the  data  sets  are  frequently  too  large  to 
support  a  caching  strategy. 

STM  is  an  implementation  of  standard  interleaved  memory  that  takes  advantage  of 
more  than  one  local  buffer  in  each  bank.  To  customize  the  memory  system  to  the  target 
problem  domain,  (e.g.,  a  vector  architecture  supporting  cyclostationary  processing),  STM 
incorporates  a  memory  decoding  scheme  based  on  permutation  decoding.  In  particular,  this 
version  of  STM  is  designed  to  provide  a  throughput  that  approaches  one  for  radix-r  and  digit 
reversal  address  patterns  as  well  as  address  patterns  of  constant  stride. 

STM  is  based  on  the  premise  that  some  latency  can  be  absorbed  by  the  processor.  In 
particular,  memory  requests  can  be  made  in  advance  of  completing  the  current  instruction. 
In  general,  memory  requests  may  be  made  so  long  as  the  memory  system  has  the  capacity 
to  accept  the  request.  Memory  capacity  will  be  described  later  in  this  section. 

A  high-level  view  of  this  concept  is  shown  in  Figure  IV.  1 .  Memory  is  partitioned 
into  A:  banks.  Each  bank  consists  of  a  smart  cache  and  a  bulk  storage  module.  The  smart 
cache  contains  memory  referred  to  as  cache  elements  that  operates  at  the  same  speed  as  the 
processor.  The  BSM-CE  controller  and  CE-bus  controller  are  responsible  for  interfacing  the 
cache  elements  with  the  bulk  storage  and  the  system  data/address  buses  respectively.  The 
CE-bus  controller  drives  two  control  lines  that  are  used  for  handshaking  with  the  processor. 


One  line  is  used  to  signal  when  memory  requests  can  be  made  by  the  processor.  The  other 
line  is  used  to  indicate  when  memory  responses  are  available  to  the  processor. 


When  the  processor  makes  a  memory  access,  the  bank  which  recognizes  the  memory 
access  latches  the  request  (i.e.,  the  address,  whether  it  is  a  read  or  write  request,  and  data  if  it 
is  a  write  request)  into  a  cache  element.  A  cache  element  (CE)  is  that  set  of  data  necessary  to 
support  one  memory  access  to  the  memory  bank.  The  cache  element's  in-use  bit,  is  also  set. 
When  the  cache  element  has  been  processed  (i.e.,  data  has  been  either  written  to  or  read  from 
the  bulk  storage  for  a  write  or  read  access  respectively),  the  cache  element’s  ready  bit  is  set. 

The  components  of  a  cache  element  are  illustrated  in  Figure  IV.2.  Each  request  is 
uniquely  identified  with  an  index  which  is  used  for  synchronization  of  memory  accesses  with 
the  processor.  This  synchronization  will  be  discussed  in  the  context  of  the  request  and 
response  counters  later  in  this  section. 
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The  smart  cache  manages  cache  elements  and  requests  for  memory  read  and  write 
accesses.  There  are  three  activities  that  take  place  on  each  clock  cycle; 

•  Memory  requests  are  recognized  and  accepted  if  the  memory  system  has 
available  capacity.  The  concept  of  memory  capacity  will  be  explored  later  in 
this  section. 

•  Memory  read  responses,  ready  for  the  processor,  are  placed  on  the  data  bus  in 
a  synchronized  order. 

•  The  BSM,  when  not  busy,  is  tasked  with  the  next  pending  read  or  write 
operation. 


Request  Index  Value  obtained  from  Request  Counter. 

Address  Physical  address. 

Data  Value  to  be  read  or  written. 

In  Use  Bit  Bit  indicating  whether  cache  element  is  available  for  use. 
Read/Write  Bit  Bit  indicating  the  type  of  memory  access  type.. 

Ready  Bit  Bit  indicating  whether  memory  request  has  been  serviced. 

This  is  used  for  a  read  transaction  to  indicate  that  the 
data  has  been  retrieved  from  DRAM. 


Figure  IV.2  Cache  Element 

Memory  accesses  are  usually  initiated  within  a  bank  without  waiting  for  previous 
access  responses  to  be  completed,  either  within  the  bank  or  from  the  memory  system  in 
general.  For  a  read  access,  the  smart  cache  retrieves  the  required  data  from  the  bulk  or  main 
storage  and  stores  it  into  the  associated  cache  element  of  the  smart  cache.  A  write  request  is 
sent  by  the  processor  to  the  smart  cache.  This  data  is  later  written  to  the  bulk  storage  by  the 
BSM-CE  controller.  For  the  design  that  follows,  the  BSM-CE  controller  processes  requests 
in  the  order  they  were  requested  within  a  bank.  All  memory  accesses  are  returned  to  the 
processor  in  the  order  that  they  were  requested  (for  all  banks). 
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Coordination  of  the  STM  is  accomplished  with  two  counters  and  two  control  lines. 
The  two  counters,  request  counter  and  the  response  counter,  are  used  to  link  a  memory 
request  and  the  associated  response  for  read  accesses.  These  counters  are  shown  in  Figure 
IV.3.  Initially,  the  request  counter  and  the  response  counter  are  set  to  zero.  When  a  memory 
request  is  accepted  from  the  processor,  the  value  of  the  request  counter  is  placed  into  the 
request  index  field  of  the  cache  element  (see  Figure  IV.2)  that  is  used  to  store  the  request. 
The  request  counter  is  then  incremented  by  one. 

The  response  counter  contains  the  index  for  the  next  read  response  needed  by  the 
processor.  When  a  CE-Bus  controller  detects  that  a  read  response  is  ready  for  a  cache 
element  and  its  request  index  is  equal  to  the  current  value  of  the  response  counter,  a  memory 
response  cycle  is  performed.  The  CE-Bus  controller  associated  with  this  response,  places  the 
associated  data  contained  in  the  cache  element  onto  the  data  bus.  The  response  counter  is 
then  incremented  by  one.  The  latency  of  a  memory  access  is 

L  =  (Request  Counter  -  Response  Counter)r^„j  (IV.l) 

where  T^bus  is  the  bus  cycle  interval.  This  expression  does  not  take  into  account  for 
latency  as  a  result  of  a  stall. 

A  key  issue  of  the  STM  design  is  the  mapping  of  addresses  to  bank  numbers  and 
indices  within  a  bank.  Several  methods  were  described  in  Chapter  II  Section  D  that  can  be 
used.  Conventional  interleaving  results  in  poor  performance  for  FFT  related  memory 
reference  patterns  when  the  radix  of  the  FFT  and  the  number  of  banks  are  both  powers  of 
two.  One  solution  is  to  pick  a  stride  and  number  of  banks  that  are  relatively  prime.  Two 
strategies  described  in  Chapter  II  select  a  prime  number  of  banks.  These  solutions  either 
incur  excessive  propagation  delay  in  the  bank  selection  hardware,  or  assume  an  addressing 
pattern  that  is  not  appropriate  for  the  butterfly  machine  architecture. 

The  two  control  lines  used  in  this  design  are  the  grant  request  and  response  enable 
control  lines.  Memory  access  requests  by  the  processor  are  controlled  by  the  grant  request 
(GR)  control  line.  Each  memory  bank's  CE-Bus  Controller  enables  the  GR  as  long  as  there 
are  available  cache  elements.  All  GR  lines  from  the  banks  are  wire-ORed  to  form  a  single 
output  signal  to  the  processor  resulting  in  a  single  ready  signal  for  all  of  the  memory  banks. 
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This  is  done  to  provide  a  simpler  interface  to  the  processor,  rather  than  having  a  line  for  each 
memory  bank.  If  any  bank  does  not  have  cache  elements  available  (i.e.,  if  it  is  “full”),  the 
GR  line  becomes  inactive  and  the  processor  will  refrain  from  further  memory  accesses  until  a 
cache  element  in  the  full  memory  bank  is  freed. 


Figure  IV.3  Top  Level  Memory  System 


The  response  enable  (RE)  control  line  performs  a  similar  role  to  that  of  GR,  but  for 
memory  responses.  Each  bank's  CE-bus  controller  generates  a  RE  signal  that  is  in  turn  wire- 
ORed  to  form  a  single  control  line  to  the  processor.  The  default  for  the  RE  line  is  to  be 
disabled.  The  next  response  required  by  the  processor  (i.e.,  the  one  pointed  to  by  the 
response  counter),  can  only  be  serviced  by  one  bank.  If  the  response  is  for  a  read  and  the 
data  has  been  retrieved  from  bulk  storage,  the  RE  line  is  enabled  by  that  bank  and  the  data  is 
placed  on  the  data  bus  for  the  processor.  The  response  counter  is  then  incremented. 

One  design  of  the  STM  smart  cache  is  shown  in  Figure  IV.4.  The  data  and  address 
buses  and  the  read-write  line  enter  the  smart  cache  in  the  upper  left-hand  comer  of  Figure 
IV.4  labeled  Data,  ADDR,  and  R/W  respectively.  The  cache  elements  are  located  in  the 
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middle  of  the  figure  with  the  CE-Bus  and  BSM-CE  controllers  located  to  the  left  and  right  of 
the  cache  elements.  The  bulk  storage  module  may  be  found  in  the  far  right  of  the  figure. 

Notice  that  in  this  design,  the  request  and  response  counters  are  logically  global  (i.e., 
at  any  instant  in  time,  there  is  a  single  value  for  each  counter).  However,  the  value  of  the 
counters  are  maintained  on  each  smart  cache  as  a  hardware  counter  and  the  global  signals 
GR  and  RE  are  used  as  a  control  signal  to  increment  all  request  counters  and  response 
counters  respectively  in  a  memory  system  rather  than  a  single  request  counter  and  response 
counter  as  shown  in  Figure  IV.3. 

Before  looking  at  the  specifics  of  this  design,  it  is  useful  to  describe  the  semantics  of 
three  counters  in  the  smart  controller.  The  definitions  for  three  counters  follow: 

•  Next  Available  Counter  (NAC)  -  This  counter  is  used  as  an  index  to  the  next 
available  cache  element  to  be  used  to  store  a  new  memory  request. 

•  Currently  Processed  Counter  (CPC)  -  This  counter  points  to  the  next  cache 
element  that  has  a  memory  request  pending  for  the  bulk  storage  module  that 
has  not  been  completed. 

•  Output  Counter  (OC)  -  The  output  counter  points  to  the  cache  element  that 
will  contain  the  next  memory  read  response  in  the  bank. 

The  relationship  among  the  three  counters  is  illustrated  in  Figure  IV.5.  All  three 
counters  are  initialized  to  zero.  This  is  the  defined  condition  for  an  empty  memory  bank.  By 
empty,  it  is  meant  that  the  bank  has  no  pending  memory  requests.  The  NAC  always  points  to 
the  next  available  free  cache  element.  As  memory  requests  are  accepted  by  the  bank,  the 
NAC  is  advanced  after  each  request  accepted.  The  CPC  follows  the  NAC  and  advances 
whenever  a  memory  request  to  bulk  storage  is  completed  by  the  BSM-CE  controller.  If  the 
bank  is  kept  busy  (i.e.,  if  there  is  always  a  memory  request  to  process  by  the  BSM-CE 
controller),  this  counter  will  generally  advance  at  a  frequency  of  the  memory  ratio.  Finally, 
the  OC  advances  whenever  a  read  request  is  processed  and  passed  back  to  the  processor. 

The  OC  is  also  advanced  whenever  it  points  to  a  cache  element  containing  a  write  request. 
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Figure  IV.4  Smart  Cache  Design 


The  relationship  of  these  three  counters  can  be  thought  of  as  pointers  to  a  circular 
queue  where  the  NAC  is  the  lead  pointer  with  the  CPC  following  the  NAC,  and  the  OC 
following  the  CPC.  Several  possible  conditions  may  arise  and  can  be  illustrated  with  Figure 
IV.5.  If  all  three  counters  point  to  the  same  location,  the  memory  bank  is  empty  (i.e.,  no 
pending  memory  requests  for  the  bulk  storage  or  output  responses).  This  is  the  initial  state  of 
the  memory  bank.  If  NAC  ==CPC,  then  there  are  no  pending  memory  requests  for  the  bulk 
storage  for  the  bank.  If  CPC  ==  OC,  then  there  are  no  pending  read  responses  from  the  bank. 
The  three  counters  can  be  thought  of  as  chasing  one  another  (CPC  chasing  the  NAC,  OC 
chasing  the  CPC,  and  finally  the  NAC  chasing  the  OC.)  Observe  that  the  CPC  is  allowed  to 
catch  up  with  the  NAC  and  the  OC  is  allowed  to  catch  up  CPC.  However,  NAC  is  not 
allowed  to  catch  OC.  If  (NAC+l)==OC,  then  no  cache  elements  are  available  and  the 
memory  is  said  to  be  full.  This  definition  for  an  available  cache  element  utilizes  k-\  of  the 
cache  elements  rather  than  all  k  of  them.  This  is  done  to  simplify  the  logic  for  detecting 
when  the  memory  bank  is  empty  or  full. 
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Returning  to  Figure  IV.4,  the  CE-bus  controller  may  be  divided  into  two  principal 
components:  that  part  that  is  responsible  for  accepting  memory  requests  and  the  other  that  is 
responsible  for  coordinating  the  read  memory  response.  Accepting  memory  requests  is 
accomplished  with  the  request  counter,  the  next  available  counter,  and  the  logic  required  to 
drive  the  GR  signal.  As  indicated  above,  the  request  counter  and  response  counter  are 
initialized  to  zero.  The  request  counter  serves  as  the  input  to  the  request  index  register  for 
the  selected  cache  element.  The  NAC  is  a  modulo-^  counter  where  k  is  the  total  number  of 
cache  elements  in  a  memory  bank.  The  NAC  is  used  in  conjunction  with  the  decoder,  to 
select  the  cache  element  to  be  used  for  the  next  memory  request.  The  NAC  and  decoder  are 
enabled  with  the  GR  Internal  (GRI)  signal,  resulting  in  the  increment  of  the  NAC  and  the 
selection  and  loading  of  the  cache  element  registers.  The  request  counter  increments 
whenever  any  bank  accepts  a  memory  read  request.  The  request  counter  is  enabled  with  GR. 


Figure  IV.5  Relationship  Between  Smart  Cache  Counters 
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The  logic  within  a  bank,  driving  the  GR  line,  is  the  logical  ORing  of  two  conditions. 
This  condition  is: 

((OC  + 1)  ~=  NAC)  OR  Empty .  (IV.2) 

The  first  condition  tests  whether  there  is  an  available  cache  element  as  described 
above.  The  second  condition,  whether  the  bank  is  empty,  is  specified  with  the  empty  flag. 
The  GR  internal  (GRI)  line  is  the  logical  AND  of  the  GR  line  and  the  logic  determining 
whether  a  bank  is  selected.  The  quantity  labeled  Bank  ID  is  compared  to  n  address  lines  to 
make  this  determination.  The  least  significant  address  bits  are  assumed  unless  otherwise 
indicated  in  this  study.  The  Bank  ID  may  be  stored  in  a  register  set  by  either  with  hardware 
switches  or  software. 

The  output  response  is  implemented  with  the  response  counter,  output  counter,  and 
the  logic  required  to  drive  the  RE  and  RE  internal  (REI)  signals.  The  response  counter  is 
incremented  whenever  RE  is  active.  RE  is  the  wired  logical  ORing  of  each  bank’s  REI  line. 
For  each  bank,  REI  is  the  logical  ANDing  of  two  conditions, 

(Response  Counter  =  Index[OC])  AND  Ready  [OC].  (IV.3) 

The  first  part  of  the  condition  checks  whether  this  bank  contains  the  next  memory 
read  response  to  be  sent  to  the  processor.  The  second  condition  is  a  check  to  ensure  that  the 
data  has  been  acquired  from  the  bulk  storage.  Data[OC]  is  passed  to  the  data  bus  by  enabling 
the  tristate  buffer. 

The  BSM-CE  controller  is  responsible  for  managing  requests  to  the  bulk  storage. 

The  currently  processed  counter  is  a  modulo-k  counter  pointing  to  the  cache  element  to  be 
processed  by  the  bulk  storage.  The  CPC  selects  the  appropriate  cache  element  to  be 
processed  using  a  multiplexer.  For  read  requests,  the  resulting  data  is  written  into  the 
appropriate  cache  element  through  the  use  of  a  decoder. 

The  cache  elements  obtain  data  from  either  the  data  bus  for  memory  writes  or  the 
BSM  for  memory  reads.  The  logic  in  Figure  IV.6  illustrates  the  interaction  between  the  data 
sources,  control  lines,  and  registers  for  one  cache  element. 
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Figure  IV.6  Block  A  for  Figure  V-4 


B.  SIMULATION  MODEL 


The  STM  simulation  program  was  written  to  explore  characteristics  of  STM 
memory  designs  and  to  determine  their  effectiveness  for  a  given  memory  reference 
pattern.  The  relationship  and  interaction  between  the  STM  simulation  and  related 
computer  programs  is  shown  in  Figure  IV.7.  The  STM  simulation  program  is  referred  to 
simply  as  STM  in  the  figure.  All  programs  were  written  in  Matlab™  and  can  be  found  in 
Appendix  A..  The  following  discussion  of  the  STM  simulator  will  be  partitioned  into  the 
signal  generators,  STM  simulator,  and  the  graphics  programs. 

1.  Signal  Generators 

The  STM  simulator  accepts  three  parameters  that  define  the  memory  system  and 
the  memory  reference  stream  that  the  STM  simulator  will  be  given  to  process.  The 
memory  reference  stream  is  contained  in  a  file  referred  to  in  Figure  IV.7  as  the  address 
stream.  The  address  stream  is  a  list  of  integer  pairs,  the  first  representing  an  address  of 
the  reference  and  the  second  a  flag  indicating  whether  it  is  a  read  or  a  write  operation. 

Four  programs  (gen_const,  gen_cfft,  gen_dr,  and  gen_rand),  referred  to 
collectively  as  signal  generators  in  the  figure,  were  written  to  generate  different  classes  of 
memory  reference  streams  to  be  used  as  inputs  to  the  simulator.  The  first  three  programs, 
gen_const,  gen_cfft  and  gen_dr  were  written  to  generate  address  patterns  common  to 
digital  signal  processing  applications.  The  gen_rand  program  provides  an  address 
stream  that  yields  a  random  address  pattern. 

The  first  program  gen_const  generates  the  most  basic  address  pattern  for  vector 
processors;  patterns  of  constant  stride.  The  program  gen_const  interface  is 

ResultVect  =  gen_const(N,  Stride,  fname) 

where: 

N  -  Number  of  addresses  to  generate 

Stride  -  Stride  of  the  pattern,  and 

fname  -  Name  of  the  file  containing  the  resulting  address  patterns. 
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The  program  gen_cfft  generates  memory  reference  patterns  consistent  with 
constant  geometry  fast  Fourier  transforms  (FFT)  with  butterfly  passes  with  a  radix  of  R. 
The  program  gen_cfft  has  the  following  calling  interface; 

ResuItVect  =  gen_cfft(N,  R,  fname) 

where: 

N  is  the  number  of  addresses  to  generate, 

R  is  one  of  the  factors  of  N,  such  that  N  =  N 1  *R,  and 

fname  is  the  name  of  the  file  containing  the  resulting  address  patterns. 

For  realistic  patterns,  it  is  expected  that  R«N1 . 


Event 
Traces  & 
Performance 
Parameters 


Files 


Programs 


Figure  IV.7  STM  Simulation  Overview 


The  program  gen_dr  provides  for  generating  digit  reversal  patterns  necessary  for 
one  pass  of  an  FFT.  The  digit  reversal  pass  is  found  in  the  last  pass  of  a  FFT  for  the  class 
of  FFT  algorithms  described  in  Section  D  of  Chapter  0. 

Table  IV.  1  illustrates  the  bit  reversal  pattern  for  a  base  of  two  with  three  digits.  The 
program  gen_dr  has  the  following  calling  interface: 


ResuItVect  =  gen_dr(NoDigits,  Base,  fname) 
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where: 


NoDigitS  is  the  number  of  digits  in  the  address  pattern, 

Base  is  the  base  of  the  number  system  used,  and 

fname  is  the  name  of  the  file  to  store  the  resulting  address  patterns. 


Normal  Pattern 
(Base  10) 

Normal  Pattern 

(Base  2) 

Bit  Reversed 

Pattern  (Base  10) 

Bit  Reversed 

Pattern  (Base  2) 

0 

000 

0 

000 

1 

001 

4 

100 

2 

010 

2 

010 

3 

Oil 

6 

no 

4 

100 

1 

001 

5 

101 

5 

101 

6 

110 

3 

oil 

7 

111 

7 

111 

Table  IV.l  Digit  Reversal  for  Three  Digits  Base  2 


The  last  program,  gen_rand  generates  a  sequence  such  that  the  probability  that 
the  next  address  will  be  sequential  or  linear  is  p,  and  therefore  a  probability  of  \-p  that  the 
next  address  will  be  a  jump  to  a  random  address.  The  calling  sequence  for  gen_rand  is: 

ResuItVect  =  gen_rand(N,  p,  seed,  fname) 

where: 


N  is  the  number  of  addresses  to  generate, 
p  is  the  probability  that  the  next  instruction  is  the  next  address, 
seed  is  the  random  number  generation  seed,  and 
NoBanks  is  the  number  of  banks  in  simulation. 

This  function  is  used  to  generate  address  patterns  characteristic  of  general-purpose 
computing.  When  p>0,  there  exists  a  sequential  address  characteristic  that  simulates 
address  patterns  that  occur  when  fetching  instructions.  The  random  nature  of  the 
simulated  address  pattern  captures  data  references  as  well  as  program  branching. 
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2. 


STM  Simulator 


The  STM  Simulation  program  consists  of  the  function  stm  and  a  collection  of 
support  functions  called  by  stm.  The  program  stm  accepts  a  memory  address  stream 
described  above,  and  three  parameters  that  define  the  memory  system: 

•  number  of  memory  banks  (NoBanks), 

•  the  number  of  cache  elements  per  memory  bank  (NoCE)  and, 

•  the  ratio  between  the  dynamic  memory  cycle  time  to  the  static  memory  cycle 
time  (MemRatio).  This  parameter  specifies  the  number  of  static  memory 
cycles  required  to  complete  one  dynamic  memory  cycle. 

The  calling  sequence  for  stm  is: 

stm(Fname, ASCII, Level, AList, NoBanks, NoCE, MemRatio, MemDecode, A) 

where 

Fname  is  the  file  name  of  the  saved  data, 

ASCII  specifies  the  format  of  the  output  file  (either  ASCII  or  binary). 

Level  specifies  the  level  of  detail  of  output  saved  in  sf_name.  There  are  three 
levels  of  detail  that  can  be  saved.  Level  0  is  a  complete  dump  of  all  of  the 
memory  bank  registers  for  each  clock  cycle.  This  level  is  used  primarily  for  test 
and  validation  of  the  program.  Level  1  provides  a  tabular  listing  of  events.  Level 
2  generates  a  file  suitable  for  input  into  the  graphics  programs. 

AList  is  the  memory  address  list.  This  is  a  matrix  where  each  row  is  of  the  form: 
[Address  RWFlag] 

NoBanks  specifies  the  number  of  banks  to  be  used  in  the  simulation, 

NoCE  is  the  number  of  cache  elements  to  be  used  in  each  bank  of  the  simulation, 

MemRatio  is  the  ratio  of  dynamic  memory  cycle  time  to  static  memory  cycle 
time, 

MemDecode  is  a  flag  that  specifies  the  type  of  memory  decoding,  and 
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A  is  the  permutation  matrix  when  MemDecode=l  and  undefined  otherwise. 

The  program  stm  models  all  of  the  counters,  registers,  and  flags  for  each  bank  of 
the  memory  under  simulation.  These  variables  can  be  categorized  as  bank  counters, 
global  counters,  the  cache  element  array,  control  signals,  flags,  and  other  variables.  The 
following  three  variables  are  used  to  model  the  counters  for  each  memory  bank; 

•  Next  Available  Counter  (NAC)  This  counter  points  to  the  next  available 
cache  element  (CE)  available  for  a  memory  (read  or  write)  request. 

•  Output  Counter  (OC)  This  counter  points  to  the  next  CE  containing  a  read 
that  has  data  ready  to  be  sent  back  to  the  processor. 

•  Currently  Processed  Counter  (CPC)  This  counter  points  to  the  CE  that  is 
currently  involved  in  either  a  dynamic  read  or  write  cycle  when  PDC  == 
TRUE. 

The  program  uses  global  counters  as  shown  in  Figure  IV.3,  rather  than  replicating  them  in 
each  memory  bank  as  indicated  in  Figure  IV.4.  They  as  defined  as: 

•  Request  Counter  (ReqC)  This  counter  is  used  to  ensure  that  each  read 
request  is  matched  with  the  read  response.  ReqC  is  loaded  into  the  next 
available  CE’s  Reqindex  field  during  a  memory  request  cycle.  Note:  This 
counter  is  conceptually  global  in  that  every  memory  bank  has  access  to  the 
ReqC  contents. 

•  Response  Counter  (ResC)  The  response  counter  is  also  conceptually  a 
global  counter  and  is  used  in  conjunction  with  the  ReqC.  ResC  is  compared 
with  the  Reqindex  value  selected  by  output  counter  (OC).  If  they  are  equal, 
the  corresponding  Ready  bit  is  set.  ResC  is  incremented  when  a  read 
response  is  returned  on  the  bus. 

•  Dynamic  Memory  Cycle  Counter  (DCount)  This  counter  is  initialized  to 
ReqCount  at  the  beginning  of  a  dynamic  memory  cycle  and  decremented  for 
each  system  cycle. 
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The  Cache  Element  (CE)  of  each  bank  is  represented  by  the  cache  element  array. 
The  cache  element  array  is  the  focus  of  the  STM  design.  Each  CE  is  a  resource  for 
processing  one  read  or  write  request  and  is  the  central  interface  between  the  main  system 
bus  and  the  dynamic  memory  module.  The  cache  element  array  consists  of  the  following 
components: 

•  Request  Index  Array  (Req Index)  The  index  of  the  CE,  addressed  by  the 
next  available  counter  (NAC),  is  loaded  with  the  value  of  the  request  counter 
(ReqC)  when  a  memory  reference  is  serviced.  Note  that  this  value  serves  as 
a  unique  identifier  for  sending  data  back  to  the  requester  (e.g.,  the  processor). 

•  Address  Array  (Address)  The  Address  of  the  CE,  selected  by  the  next 
available  counter  (NAC),  is  loaded  with  the  value  of  the  address  bus  (ADDR) 
when  a  memory  reference  is  serviced. 

•  R/W  Bit  Array  (RW)  The  RW  bit  of  the  CE,  selected  by  the  next  available 
counter  (NAC),  is  loaded  with  the  value  of  the  address  bus  signal  indicating 
either  a  read  or  a  write  request  when  a  memory  reference  is  serviced. 

•  Ready  Bit  Array  (Ready)  The  Ready  bit  of  the  CE,  selected  by  the  next 
available  counter  (NAC),  is  reset,  indicating  either  data  for  a  read  request  is 
not  available  or  a  write  request  has  not  been  completed  when  a  memory 
reference  is  serviced.  This  bit  is  set  for  a  read  request  when  the  data  has  been 
loaded  from  the  dynamic  memory. 

•  Data  Array  (Data)  The  Data  Array  is  used  differently  for  read  and  write 
memory  requests.  For  a  memory  read,  the  Data  array  of  the  CE,  selected  by 
the  currently  processed  counter  (CPC),  is  loaded  at  the  end  of  a  dynamic 
write  cycle.  This  is  followed  by  the  Data  array  being  read  and  passed  to  the 
Data  Bus  (DATA),  when  referenced  by  the  output  counter  (OC)  and  the 
Request  Enable  (RE)  line  is  active.  For  a  memory  write,  the  contents  of  the 
DATA  bus  is  written  into  the  Data  array  of  the  CE,  selected  by  the  next 
available  counter  (NAC). 
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The  control  signals  for  stm  are  defined  as  follows: 

•  Grant  Request  Internal  (GRI)  When  active,  this  signal  indicates  that  a 
bank  is  ready  to  accept  requests  from  the  processor. 

•  Grant  Request  (GR)  When  active,  this  signal  indicates  that  the  memory 
system  is  ready  to  accept  requests  from  the  processor.  The  GR  signal  is 
formed  by  a  wired  AND  of  all  of  the  GRI  signals. 

•  Response  Enable  Internal  (REI)  When  active,  this  signal  indicates  that  a 
bank  is  ready  to  send  data  requested  with  a  read  request,  back  to  the 
processor. 

•  Response  Enable  (RE)  When  active,  this  signal  indicates  that  the  memory 
system  is  ready  to  send  data  requested  with  a  read  request,  back  to  the 
processor.  The  RE  signal  is  formed  by  a  wired  AND  of  all  of  the  REI 
signals. 

•  Bank  Select  (BS)  When  active,  this  signal  indicates  that  this  bank  has  been 
selected  for  a  memory  access. 

•  Start  Dynamic  Read  Cycle  (SDRC)  This  signal  indicates  the  beginning  of 
a  dynamic  read  cycle. 

•  Start  Dynamic  Write  Cycle  (SDWC)  This  signal  indicates  the  beginning  of 
a  dynamic  write  cycle. 

The  following  is  a  list  of  the  flags  defined  in  stm: 

•  Empty  This  flag  is  active  whenever  there  are  no  memory  requests  in  the 
smart  cache. 

•  Processing  Dynamic  Cycle  (PDC)  This  flag  is  active  whenever  the  dynamic 
memory  subsystem  is  processing  a  memory  request. 

ReqCount  is  a  variable  defined  in  stm  that  specifies  the  total  number  of  system 
cycles  required  to  process  a  dynamic  memory  cycle. 
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A  simplified  algorithmic  description  of  stm  is  shown  in  Figure  IV.8.  The  style 
used  to  describe  the  algorithm  is  borrowed  from  the  Matlab  language.  Formal  variables 
are  not  shown  except  where  they  provide  clarity  to  the  algorithm.  Other  details,  such  as 
file  I/O,  are  also  not  shown. 

The  first  function,  initialize(),  represents  all  one-time  initialization  required  for 
stm,  such  as  initializing  counters  and  cache  element  arrays  for  the  banks.  The  work  is 
accomplished  in  the  WHILE  loop  which  executes  until  the  variable  done  is  set  to  TRUE 
by  the  function  simulation_complete(),  which  returns  TRUE  when  there  are  no  more 
addresses  to  process  and  when  all  of  the  memory  banks  are  empty. 


stm(AList,NoBanks,NoCE,MemRatio) 
initializeO; 
done  =  FALSE; 
while  -done, 

GRI  =  evaluate_gri(); 

REI  =  evaluate_rei(); 

Empty  =  evaluate_empty(); 
generate_address(); 
for  BankNo  =  1  :NoBanks, 
memory_response(); 
service_dynamic_memory(); 
service_memory_request(); 

end; 

done  =  simulation_complete(); 
save_results(); 

System_Clock  =  System_Clock  -i- 1 ; 

end; 


Figure  IV.8  SimpUned  Algorithmic  Description  of  stm 

Each  pass  of  the  WHILE  loop  processes  one  clock  cycle.  Each  memory  bank  is 
evaluated  to  determine  the  status  of  GRI,  REI,  and  Empty  with  the  calls  to 
evaluate_gri(),  evaluate_rei(),  and  evaluate_empty().  The  function 
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generate_address()  then  conditionally  generates  an  address  if  the  memory  system  is 
able  to  accept  a  new  request.  The  conditional  nature  of  the  address  generation  is  not 
shown  explicitly  here.  Each  bank  is  then  evaluated  in  the  FOR  loop. 

There  are  up  to  three  events  that  may  occur  with  each  memory  bank.  These 
events  are  defined  as  follows; 

•  Accept  a  memory  request, 

•  Generate  a  bulk  storage  memory  cycle.  This  comes  in  three  types,  generate  a 
read  bulk  storage  memory  cycle,  generate  a  write  bulk  storage  memory  cycle, 
and  generate  a  Processing  bulk  storage  cycle. 

•  Send  a  read  response. 

These  events  are  processed  by  the  functions  memory_response(), 
service_dynamic_memory(),  and  service_memory_request()  respectively.  The 
variable  done  is  then  set,  results  for  this  cycle  are  saved  with  the  function 
save_results(),  and  the  system  clock  is  incremented. 

The  results  are  saved  in  a  file  that  can  be  processed  using  graphics  programs 
described  in  the  next  section. 

3.  Graphics  Programs 

The  graphics  programs  shown  in  Figure  IV.7  provide  graphic  plots  of  the  memory 
traces  and  compute  scalar  performance  measurements  of  the  simulation  results.  The 
primary  graphics  function  is  called  m_anal().  This  program  produces  plots  that  are  used 
to  obtain  quantitative  and  qualitative  insight  into  a  particular  simulation  run.  Its  calling 
convention  is  as  follows: 

[  TP,S,MaxL,AvgL,StdL]  = 

nn_anal(fname, ASCII, Apattern,WinLen,PlotFlag, Length, PrintFlag) 

where 

fname  is  the  name  of  the  file  containing  data  produced  by  stm  to  process. 
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ASCII  is  a  flag  indicating  whether  fname  is  stored  as  ASCII  or  binary  file.  A  0 
and  1  specifies  binary  and  ASCII  respectively. 

Apattern  is  a  short  description  of  the  Address  pattern  to  be  used  for  the  title  of  the 
graph. 

WinLen  is  the  length  of  the  smoothing  window  for  computing  instantaneous 
throughput. 

PlotFlag  is  a  flag  indicating  the  number  and  types  of  plots  to  be  produced.  Valid 
values  are  0  and  1,  indicating  no  plots  or  one  plot  respectively. 

Length  specifies  the  number  of  points  used  in  a  plot.  A  value  of  0  means  use  all 
of  the  points.  Any  value  greater  than  0  indicates  the  number  of  point  to  plot. 

PrintFlag  is  a  flag  indicating  the  types  of  output  desired.  A  value  of  0  means 
print  to  the  screen,  a  value  of  1  means  print  to  a  postscript  file,  and  a  value  of  2 
means  send  directly  to  a  printer. 

This  function  produces  the  scalar  output  statistics  of  throughput  (TP),  speedup 
(S),  maximum  latency  (MaxL),  average  latency  (AvgL),  and  the  standard  deviation 
(StdL)  for  the  simulation.  These  statistics  may  be  an  end  result  in  themselves  or  they  can 
be  used  as  input  into  the  graphics  program  p_mesh()  described  below. 

An  example  of  the  plot  produced  by  m_anal()  is  shown  in  Figure  IV.9.  The  input 
parameters  (i.e.,  a  short  description  of  the  memory  address  stream,  and  the  three 
parameters  that  define  a  STM  memory)  are  shown  above  the  top  graph.  The  plot  on  the 
top  is  the  instantaneous  latency  versus  time.  For  this  example,  the  latency  begins  at  nine 
and  becomes  16  at  steady  state. 

Scalar  performance  parameters  are  displayed  above  the  middle  plot.  These 
parameters  are  speedup,  average  throughput,  maximum  throughput,  average  throughput, 
and  the  standard  deviation  of  the  throughput.  The  middle  plot  is  a  moving  average  of  the 
throughput  based  on  a  window  of  length  WinLen.  Those  plots  shown  in  this  document 
were  constructed  with  WinLen  =  8  unless  otherwise  stated.  The  plot  on  the  bottom  is  a 
time  series  display  of  the  control  lines  grant  request  (GR)  and  request  enable  (RE).  GR  is 
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active  if  the  line  is  high  (at  the  GR  level)  and  inactive  otherwise.  RE  may  be  interpreted 
in  a  similar  manner. 

The  second  graphics  function  is  called  p_mesh().  This  function  is  responsible 
for  constructing  a  mesh  plot  of  one  type  of  scalar  performance  measurement  (e.g., 
speedup)  produced  by  m_anal.  This  type  of  plot  is  used  to  compare  a  set  of  performance 
measurements  when  two  variables  (NoCE  and  NoBanks)  are  varied  over  a  range.  In 
Figure  IV.  10,  the  performance  variable  speedup  is  plotted  for  NoCE  ranging  from  1  to  64 
and  NoBanks  ranging  from  4  to  64. 
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STM  Status 


Addr  Type:  Radix  2  #Banks=16  #CEs=16  Mem  Ratio=8 


S=7.595  AvgTP=0.9494  MaxL=16  AvgL=15.98  StdL=0.39c 


Figure  IV.9  Example  Plot  From  m_anal  Function 
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Figure  IV.IO  Example  Mesh  Plot  for  the  Performance  Parameter  Speedup 
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V.  THEORETICAL  PERFORMANCE  ANALYSIS  OF  STM 

In  this  section,  performance  of  the  STM  will  be  investigated  and  described  in 
terms  of  memory  parameters  and  characteristics  of  the  input  memory  reference  vector. 
The  conventional  bank  number  decoding  scheme  is  assumed  as  described  in  Section  D  of 
Chapter  n  for  Sections  A  through  C.  In  Section  D,  the  effects  of  permutation-based 
decoding  will  be  examined.  The  bank  selection  pattern  will  be  described  in  terms  of  the 
characteristic  of  the  input  memory  reference  vector  and  the  memory  parameters.  The 
bank  selection  patterns  will  then  be  used  to  determine  expressions  for  the  performance 
parameters,  steady-state  throughput  (TPss),  and  the  maximum  latency  (Lmax)- 

The  following  analysis  assumes  that  all  memory  references  are  read  requests. 

This  provides  for  the  worst  case  analysis  for  STM.  This  analysis  will  begin  with  the 
simplest  of  the  input  reference  streams,  those  streams  with  constant  stride.  Information 
gained  from  this  analysis  will  then  be  used  to  address  radix-r  butterfly  and  digit-reversed 
patterns. 


A.  CONSTANT  STRIDE 

The  parameters  pertinent  to  performance  measures  for  constant-stride  address 
patterns  are: 

•  Stride  length  (S), 

•  Number  of  banks  in  the  memory  system  (B), 

•  Number  of  cache  elements  per  memory  bank  (C£), 


•  Ratio  of  bulk  store  to  static  memory  cycle  time  (MR). 


The  expression  for  the  effective  number  of  banks  is  repeated  below  for 
convenience.  Given  a  stride  S,  and  a  number  of  banks  B,  the  number  of  memory  banks 
that  will  actually  be  used  can  be  expressed  as: 


B  -  ^ 

gcd(S,5)’ 


(V.1) 
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where  gcd{a,b)  is  the  greatest  common  denominator  of  a  and  b.  fiq^will  be  referred  to  as 
the  effeetive  number  of  banks.  For  example,  if  S  and  B  are  relatively  prime,  then  5^^=  B. 
Alternatively  if  5  is  a  factor  of  S,  then  B^jf  =  1 . 


For  the  set  of  Bgff  effective  banks  for  a  given  input  memory  reference  vector,  the 
constant-stride  address  pattern  distributes  the  addresses  evenly  with  a  size  of  one.  By 
“evenly”  it  is  meant  that  each  of  the  effective  banks  is  presented  addresses  in  a  round 
robin  pattern.  By  “with  a  size  of  one,”  it  is  meant  that  each  bank  is  given  one  address  at  a 
time.  Figure  V.l  illustrates  a  memory  system  with  Bgff  banks,  each  bank  with  CE  cache 
elements.  This  figure  assumes  that  all  of  the  banks  are  effective,  or  alternatively,  only  the 
effective  banks  are  shown.  The  entries  in  each  cache  element  represent  the  placement  of 
the  sequence  number  of  each  memory  address  where  the  addresses  are  distributed  evenly 
with  a  size  of  one  as  described  above.  This  is,  in  fact,  the  optimum  placement  within  the 
effective  banks  because  the  work  is  spread  evenly.  At  any  point  in  time,  the  bank  that 
will  receive  the  next  memory  request  will  be  the  bank  that  is  the  least  busy.  Additionally, 
the  use  of  input  and  output  buffers  for  standard  interleaving  and  cache  elements  for  STM, 
provide  pipelining  of  memory  requests  to  each  bank.  This  allows  each  bank  to  execute 
memory  references  with  no  wait  cycles  as  long  as  there  are  memory  references  to  process. 


CE  Index 

Addr# 

0 

1 

-  1 

• 

• 

C£-l 

CEB,ff  -  1 

Figure  V.l  Interleaved  Memory  Address  Space:  Conventional  Bank  Selection 


The  round  robin  pattern  coupled  with  the  use  of  pipelining  ensure  that  the  bulk 
storage  modules  associated  with  the  effective  banks  will  be  fully  utilized  for  the  constant- 
stride  address  pattern. 


100 


For  any  interleaved  memory  system,  the  steady-state  throughput  is  determined  by 
the  memory  ratio,  the  number  of  effective  banks,  and  the  efficiency  with  which  the 
effective  banks  are  utilized.  As  indicated  above,  there  is  full  utilization  for  the  constant 
stride  pattern.  The  following  discussion  describes  the  relationship  between  the  memory 
ratio  and  the  number  of  effective  banks. 

The  total  number  of  bulk  storage  cycles  {CgsM.r)  required  to  process  memory 
requests  is 

CBSM.r  =  N^MR  (V.2) 

The  number  of  bulk  storage  cycles  available  [Cbsm.cc)  with  a  memory  consisting  of 
B  effective  banks  M  during  N  cycles  is: 

^BSM.a  ~  ^eff  '  ^  •  (V  .3) 

The  banks  can  be  assured  to  be  used  efficiently,  for  the  reasons  described  above 
and  therefore  all  of  the  available  bulk  storage  cycles  will  be  used.  Setting  Equation  (V.2) 
equal  to  Equation  (V.3),  and  applying  the  definition  of  throughput  of  Equation  (II.6), 
yields: 

Note  that  the  range  of  clock  cycles  used  to  compute  the  steady-state  throughput  is 
assumed  to  be  in  the  steady-state  region  when  applying  Equation  (II.4).  The  banks  cannot 
process  more  memory  requests  than  are  available.  If  >  MR  then  the  maximum 

throughput  is  obtained,  therefore: 

7P„  =  1.0  for  B^^  >  MR.  (V.5) 

If  the  number  of  effective  banks  is  less  than  MR,  then  throughput  will  be 
proportional  to  the  memory  ratio  as  shown  in  Equation  (V.4). 

Under  ideal  conditions,  latency  is  expressed  as 

Lmin  =  MR  +  2  (V.6) 
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since  MR  cycles  are  required  to  process  a  memory  request  and  one  cycle  is  required  for 
the  input  and  another  for  the  output  of  the  memory  request.  If  the  number  of  banks  is 
equal  to  or  exceeds  the  memory  ratio,  the  minimum  latency  is  obtained  for  constant  stride 
addressing  patterns  because  there  is  sufficient  memory  capacity  to  process  the  memory 
requests,  and  the  memory  references  are  allocated  efficiently  to  the  banks.  Therefore, 

Lmax  =MR  +  2  when  >  MR  and  NoCE  >  2  (V.7) 

for  STM  and  standard  interleaving. 

If  the  throughput  is  not  optimum,  (i.e.,  MR  >  B^^),  then  the  steady-state  latency 

becomes  a  function  of  the  memory  ratio  and  the  number  of  effective  cache  elements. 
Throughput  not  optimum  implies  that  there  are  more  memory  requests  than  can  be 
processed  per  unit  of  time.  The  steady-state  latency  associated  with  a  constant-stride 
address  pattern  when  the  throughput  is  not  optimal  will  be  described  shortly  with  the  aid 
of  Figure  V.3. 

The  relationships  between  the  performance  measures  and  memory  parameters  for 
a  constant-stride  address  pattern  is  illustrated  in  Figure  V.2  and  Figure  V.3.  The  timing 
diagram  in  Figure  V.2  is  for  a  four  bank  memory  (labeled  BO  through  B3)  each  with  two 
cache  elements  indicated  by  the  letters  a  and  b  The  top  row  is  a  clock  for  reference 
purposes.  The  row  labeled  Bus  reflects  the  corresponding  bank  numbers  of  the  address 
stream  driven  by  the  processor.  The  superscripts  on  the  bank  numbers  are  used  to 
uniquely  identify  each  memory  reference. 

The  first  memory  reference,  0°  is  placed  on  the  bus  at  clock  cycle  0  and  accepted 
by  first  cache  element  of  bank  0  {BQa)  at  clock  cycle  1  as  indicated  by  the  entry  0°  .  The 
next  four  entries,  p\,  pi,  pi,  and  pA  indicate  the  time  required  for  the  bulk  memory  to 
process  the  memory  request.  The  next  entry  indicates  that  the  memory  response  is 
passed  back  to  the  processor. 

The  memory  ratio  is  four  (as  indicated  by  the  p\  through  pA  entries).  Therefore, 
based  on  Equation  (V.5),  the  memory  will  support  maximum  throughput  for  an  address 
stream  with  constant  stride  of  one  which  is  suggested  by  the  round  robin  bank  pattern 
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shown  on  the  bus.  The  first  memory  response  occurs  at  clock  cycle  six  with  a  latency  of 
six.  The  memory  system  responds  with  an  output  every  cycle  thereafter  yielding  a 
throughput  of  1 .0.  Note  that  in  general,  a  memory  request  is  accepted  in  a  bank  when  the 
currently  processed  memory  request  is  in  its  fourth  cycle,  thereby  queuing  up  the  new 
memory  request  just  in  time  to  keep  the  bulk  memory  continuously  busy.  By  inspection, 
it  can  be  seen  that  the  latency  is  six  for  all  memory  references.  There  is  substantial 
regularity  in  this  example  because  of  the  constant  stride  of  one  address  pattern  and 
because  there  are  sufficient  banks  to  support  a  throughput  of  1 .0. 


Figure  V.2  Timing  Diagram:  Optimal  Throughput 


Figure  V.3  again  illustrates  a  constant  stride  of  one  address  pattern  with  an 
interleaved  memory  system  with  a  memory  ratio  of  four.  In  this  instance  however,  there 
are  only  three  effective  banks  labeled  BOx  through  B2x  where  the  x  is  either  an 
a,  b,  or  a  c  indicating  the  three  cache  elements  for  each  bank. 

The  first  17  cycles  are  shown  in  Figure  V.3a.  The  first  memory  request  of  each 
bank  is  accepted  and  processed  in  the  same  manner  as  in  Figure  V.2.  However,  the 

second  memory  request  of  bank  BO  represented  by  0^  is  accepted  during  the  third 
processing  cycle  (p3)  of  the  previous  memory  request  rather  than  on  the  fourth  cycle  as  in 
Figure  V.2.  This  requires  the  second  memory  request  to  wait  one  cycle  (indicated  by  the 
w  in  cycle  five)  for  the  bulk  store  memory  to  become  available.  This  scenario  is  repeated 
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for  each  bank  (see  and  2*).  The  result  is  that  each  bank  incurs  one  wait  state.  The 

next  set  of  memory  requests  labeled  0^,  1^,  and  2^  result  in  a  second  wait  state.  In 
general,  an  additional  wait  state  is  added  after  each  additional  memory  request  is 
processed  until  a  total  of  six  wait  states  have  been  accumulated  (see  O',  T,  and  2',  for 
j=l . .  .6).  Memory  request  O’  cannot  be  accepted  on  cycle  22  since  all  cache  elements  are 
in  use.  Cache  element  b  of  bank  0  becomes  free  on  cycle  23,  freeing  cache  element  c. 

Once  the  cache  elements  become  saturated  as  described  above,  each  cache 
element  is  associated  with  a  memory  request  that  is  either  waiting  to  be  processed  or  is  in 
process.  Using  0®  as  an  example,  four  processing  cycles  are  used  to  process 
0^,  o’,  and  0*,  each  requiring  four  eycles.  Therefore,  the  maximum  latency  can  be  seen 
to  be  equal  to  the  number  of  cache  elements  plus  one,  times  the  memory  ratio  minus  the 
number  of  effective  banks.  This  relationship  is  expressed  as 

L^={NoCE  +  \)MR-B^ff  (V.8) 

and  is  applicable  to  both  STM  and  to  standard  interleaving. 

Following  the  transient,  which  ends  at  cycle  five,  each  set  of  four  cycles  (e.g.,  six 
through  nine)  contains  three  outputs  (one  from  each  bank)  and  one  cycle  with  no  output. 
This  yields  the  anticipated  0.75  throughput  as  specified  in  Equation  (V.4). 

The  next  section  will  describe  the  theoretical  memory  performance  for  radix-r 
butterfly  address  patterns. 
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Figure  V.3  Timing  Diagram:  Non-Optimal  Throughput 
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B. 


RADIX-/?  BUTTERFLY  ADDRESSING 


The  set  of  fast  Fourier  transform  algorithms  known  as  the  Cooly-Tukey 
algorithms  [Ref  52]  provide  a  technique  for  computing  FFTs  for  vectors  of  length  N 
where  N  is  defined  as, 

=  . (V.9) 

where  N,  n,  and  ki  are  all  integers.  The  set  of  algorithms  used  in  this  architecture  are 
derived  using  decimation-in  frequency  [Ref  51].  A  derivation  of  a  radix-4  butterfly 
decimation-in  frequency  algorithm  can  be  found  in  Chapter  0. 

There  are  three  types  of  address  patterns  related  to  constant  geometry  fast  Fourier 
transform  (FFT).  Figure  V.4  depicts  an  eight  point  decimation-in-frequency  FFT  using 
radix-2  butterflies  and  will  be  used  to  illustrate  the  address  patterns  related  to  the 
computation  of  FFTs  of  interest  in  this  dissertation. 

To  initialize  the  input  data  vector,  the  input  data  must  be  placed  in  sequential 
order  which  requires  a  constant  stride  pattern  of  stride  one.  The  analysis  for  this  pass  is 
described  in  the  previous  section. 

The  input  address  pattern  for  each  intermediate  pass  is  constructed  by  partitioning 
the  input  array  into  r  parts  where  r  is  the  radix  of  the  butterfly.  The  first  element  of  each 
partition  is  accessed  to  compute  the  first  butterfly.  The  second  element  of  each  partition 
is  then  accessed  for  the  second  butterfly,  etc.,  until  all  points  of  the  array  have  been  used. 
This  results  in  an  address  pattern  of  constant  stride  for  each  radix-r  butterfly.  The  stride 
is: 

N 

S  =  —  (V.IO) 

r 

where 

N  is  the  length  of  the  input  vector  (eight  in  the  example),  and 

r  is  the  radix  of  the  butterfly  (two  in  the  example). 
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These  relatively  short  sequences  of  length  r  are  concatenated  to  form  the  radix-r  memory 
reference  stream  for  a  pass.  This  memory  reference  pattern  is  the  focus  in  this  section. 

The  last  memory  address  pattern  is  digit  reversal  which  is  required  to  access  the 
results  of  the  last  pass,  as  can  be  seen  in  Figure  V.4.  Performance  analysis  of  digit 
reversal  patterns  will  be  addressed  in  the  next  section. 


Figure  V.4  Radix-2  Constant  Geometry  Decimation-in-Frequency  FFT 


The  significance  of  the  above  discussion  to  STMs  is  that  data  is  selected  with  a 
stride  S  as  defined  in  Equation  (V.  10).  For  a  STM  with  a  memory  composed  of  B  banks, 
up  to  Bsei  banks  will  be  selected  where  Bsei  can  be  expressed  as 


But,  since  only  the  first  r  elements  of  are  taken  for  the  butterfly  operation,  the 
actual  number  of  banks  selected  within  a  set  of  numbers  for  one  butterfly  can  never  be 
greater  than  r. 
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On  the  other  hand,  since  the  address  used  in  each  partition  increases  by  one  for 
each  set  of  numbers  used  in  a  butterfly  operation,  all  banks  will  be  used.  In  the  worst 
case,  B  is  a  factor  of  S,  and  Bsei  =  1 .  In  this  case,  a  single  bank  will  receive  all  r  of  the 
memory  requests  for  a  given  butterfly.  However,  the  next  set  of  r  numbers  taken  for  the 
butterfly,  will  be  sent  to  another  bank.  All  banks  will  be  given  a  task  within  B  sets  of 
butterflies  or  within  r-B  samples.  If  the  number  of  banks  and  the  radix  are  both  powers 
of  two,  this  worst  case  scenario  is  the  case. 

The  steady-state  throughput  and  maximum  latency  can  be  visualized  for  standard 
interleaving  with  the  aid  of  Figure  V.5  which  contains  a  segment  of  a  timing  diagram  for 
a  radix-4  butterfly  address  pattern  for  a  standard  interleaved  memory  with  four  banks. 
Note  only  three  banks  are  shown. 
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Figure  V.5  Timing  Diagram  for  Radix-4  Butterfly  Pattern  (Standard  Interleaving) 


First  consider  the  steady-state  throughput.  It  is  assumed  in  this  diagram  that  only 
one  bank  is  selected  for  each  butterfly  as  described  above.  In  this  case,  a  bank  B^  will 
receive  r  consecutive  memory  requests  followed  by  r  memory  requests  to  bank  Bj+i. 
When  the  last  bank  receives  its  memory  requests,  the  process  repeats  with  the  first  bank. 
From  the  figure,  it  can  be  seen  that  the  pattern  is  cyclic  and  contains  two  regions  of 
activity  as  it  relates  to  throughput.  For  a  given  bank,  the  first  and  last  memory  references 
are  processed  in  parallel  with  the  previous  and  next  banks,  respectively.  This  is  the  first 
type  of  region  referred  to  above.  One  such  region  is  located  between  cycles  six  through 
ten,  and  the  next  between  cycles  19  through  23  (Only  cycles  18  and  19  are  shown  in  the 
figure).  The  other  type  of  region  is  found  between  instances  of  the  first  type  of  region.  In 
this  second  type  of  region,  one  bank  is  processing  memory  requests  alone  (i.e.,  no  other 
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banks  have  requests  to  process).  The  representative  region  shown  in  the  figure  is  between 
cycles  1 1  though  18.  Therefore,  one  representative  period  of  this  cyclic  pattern  begins 
with  cycle  six  and  ends  with  cycle  18.  Since  the  pattern  is  cyclic,  the  throughput 
represented  by  this  period  is  the  steady-state  throughput. 

The  number  of  cycles  represented  by  the  two  regions  is  MR  + 1  and  (r  -  2)MR , 
respectively,  for  a  total  of  MR  -t- 1  +  (r  -  2)MR  cycles.  During  this  period  of  time,  two 
outputs  occur  in  the  first  region  and  r-2  outputs  occur  in  the  second  region  for  a  total  of 
r  +  2  outputs.  The  steady-state  throughput  is  the  ratio  of  the  number  of  outputs  to  the 
total  number  of  cycles  during  the  period  and  is  expressed  as 


TP 

SS 


r 

MR^\  +  {r-2)MR 


r 

{r-l)MR  +  \' 


when  B  >  MR 


(V.12) 


This  analysis  assumes  that  the  number  of  banks  is  matched  to  the  memory  ratio. 

The  maximum  latency  can  be  determined  by  inspection  of  Figure  V.5.  Consider 
one  memory  request  such  as  1^  It  is  available  initially  on  the  bus  at  cycle  six  and  must 
first  wait  for  1®  to  finish  processing.  This  results  in  a  delay  of  MR  cycles.  Processing  of 

requires  MR  more  cycles.  One  additional  cycle  is  needed  to  transfer  the  result  back  to 
the  processor  for  a  total  maximum  latency 

L^^=2MR  +  l  (V.13) 


when  B>  MR. 


Now  consider  STM  memory  architectures.  As  indicated  above,  all  banks  are 
utilized  in  radix-r  address  patterns.  Further,  a  maximum  of  r  consecutive  memory 
requests  can  be  made  to  a  bank.  Therefore,  the  banks  will  be  utilized  efficiently  if 

NoCE>r  +  l  (V.14) 

because  this  ensures  that  a  bank  will  not  stall  when  presented  with  r  consecutive  memory 
requests.  The  steady-state  throughput  is  then 
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Figure  V.6  Timing  Diagram  for  Radix-4  Butterfly  Pattern  STM(4,5,4) 


C.  DIGIT  REVERSAL 

An  address  can  be  expressed  as 

index  =  4 - h  +a^r  +  aQ 


where 


(V.17) 


r'  is  the  radix  of  the  butterfly  operation  raised  to  the  ith  power. 
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fl/  is  a  digit  of  the  base  r  number  system,  and 

index  is  the  index  into  the  data  array.  It  will  be  assumed  that  the  array  begins  at 
address  0,  making  the  index  equivalent  to  the  address.  This  is  a  valid  assumption 
since  shifting  the  array  in  the  address  space  does  not  effect  the  analysis. 

The  equivalent  digit-reversed  number  representation  is  then 

indexj,  =aor"  +  a, r""' +a2r"'^ (V.18) 


where  indexdr  is  the  digit-reversed  index. 


The  digit-reversed  address  pattern  is  a  constant  stride  pattern  with  a  stride  of  one 
that  is  digit  reversed.  The  resulting  sequence  is  one  that  increments  by  r"  as  aQ  cycles 
from  1  to  r-l.  When  Oq  =0,  ai  increments.  The  relationship  holds  for  a,  and  ai./  for 
each  i. 


Therefore,  it  can  be  seen  that  the  digit  reversal  address  pattern  is  composed  of  a 
set  of  constant  stride  sequences  of  length  r  that  are  concatenated  together.  Equation  (V.l) 
provides  insight  into  the  effectiveness  of  an  interleaved  memory  system  with 
conventional  decoding.  Within  a  sequence,  the  effective  number  of  banks  is 
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(V.19) 


If  r  and  B  are  relatively  prime,  then  B^^  =  B  for  each  sequence  and  for  the  address 

pattern  at  large.  If,  however,  both  the  number  of  banks  and  the  radix  is  a  power  of  two, 
then  the  effective  number  of  banks  is  one  for  all  practical  situations.  Therefore,  when  the 
number  of  banks  and  the  radix  are  a  power  of  two,  the  throughput  approaches  1/5. 

This  result  is  based  on  the  assumption  that  the  number  of  cache  elements  is 
relatively  small  with  respect  to  the  length  of  the  input  vector.  Suppose  that  the  number  of 
cache  elements  is  sufficiently  large  to  accept  all  memory  requests  without  a  stall.  The 
digit-reversed  address  pattern  has  the  property  that  each  bank  receives  NjB  consecutive 
memory  requests,  where  N  is  the  length  of  the  input  vector  and  B  is  the  number  of  banks. 
Because  each  bank  has  a  sufficient  number  of  cache  elements  {NjB)  to  accept  all  of  the 
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memory  requests  without  a  stall,  all  memory  requests  are  delivered  in  N  cycles.  The  last 
bank  receives  its  first  memory  request  at  cycle 


,,  N  NB-N 

N - = - . 

B  B 


(V.20) 


Assuming  that  the  number  of  banks  is  matched  to  the  memory  ratio,  the  number 
of  cycles  required  for  the  last  bank  to  process  its  memory  requests  is 

N 


B 


MR^N. 


(V.21) 


Therefore,  the  number  of  cycles  required  to  process  all  of  the  memory  requests  is 

(V.22) 


NB-N  NB-N  +  NB  2NB-N 

-  +  N  =  - 
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The  average  throughput  is  defined  as  the  ratio  of  the  number  of  cycles  needed 
with  an  ideal  memory  device  to  the  actual  number  of  cycles  required  (see  Equation  II.4) 
therefore 
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(V.23) 


when  NoCE  >  NjB.  Although  this  represents  a  substantial  improvement  from  the 
previous  result,  it  is  achieved  at  a  substantial  cost  in  hardware.  In  any  case,  it  provides  a 
throughput  of  approximately  0.5  for  even  a  modest  number  of  banks. 

The  poor  performance  of  an  interleaved  memory  system  using  conventional 
decoding  for  the  digit-reversed  case,  strongly  suggests  that  a  modification  is  required  in 
order  to  obtain  satisfactory  throughput  for  the  digit  reversal  pattern  when  that  base  of  the 
digit  is  a  power  of  two.  The  modification  selected  is  permutation-based  memory 
decoding  as  described  in  Section  E  of  Chapter  0.  The  following  discussion  describes  the 
anticipated  performance  for  constant  stride,  radix-2,  and  digit-reversed  address  patterns 
when  permutation-based  decoding  is  used. 


D.  PERMUTATION-BASED  DECODING  PERFORMANCE 


In  this  section,  the  performance  of  the  three  memory  address  patterns  described 
above  will  be  analyzed  based  on  a  bank  decoding  scheme  using  a  permutation  matrix  as 
described  in  Section  E  of  Chapter  0.  Following  the  approach  above  for  conventional 
decoding,  the  simplest  addressing  pattern,  addresses  with  constant  stride  will  be  analyzed 
first.  Results  from  this  analysis  will  then  be  applied  to  the  radix-r  butterfly  and  digit- 
reversed  addressing  patterns. 

Permutation  based  decoding  was  pursued  due  to  the  poor  performance 
encountered  when  the  number  of  banks  in  the  memory  system  and  the  characteristic  of 
the  addressing  pattern  (e.g.,  the  stride  in  constant  stride  addressing  patterns)  were  not 
relatively  prime.  The  problem  is  most  severe  for  digit-reversed  patterns  that  are 
characterized  with  sequences  with  large  constant  strides. 

As  shown  in  Chapter  0,  Section  E  ,  each  bank  is  selected  once  and  only  once 
within  a  base  sequence  when  a  non  singular  permutation  matrix  with  dimension  n  by  n  is 
used  to  decode  the  bank  number.  An  expanded  permutation  matrix  that  uses  more 
address  bits  for  bank  selection  results  in  the  base  sequence  of  bank  numbers  to  be 
permuted  as  illustrated  in  Figure  ni.  12.  All  of  the  bank  numbers  are  represented  in  each 
block  although  the  order  will  usually  vary. 

Therefore,  the  worst  case  scenario  is  that  the  last  bank  number  of  one  block  will 
be  followed  by  the  same  bank  number  of  another  block.  For  example,  for  a  four-bank 
memory,  the  first  and  second  blocks  could  be  {0  1  2  3}  and  {3  2  10}  respectively.  If 
these  were  the  only  permutations  of  the  bank  number  pattern,  the  banks  3  and  1  would 
always  be  given  two  consecutive  memory  references. 

The  following  describes  the  steady-state  throughput  and  maximum  latency  when 
permutation-based  bank  decoding  is  in  use.  In  a  standard  interleaved  memory,  the  lower 
bound  of  the  steady-state  throughput  can  be  derived  by  observing  a  cyclic  pattern  of  the 
output.  Within  a  set  of  bank  numbers,  each  bank  receives  a  request,  processes  the 
request,  and  then  places  its  response  on  the  bus  at  the  appropriate  time.  The  memory 
system  responds  with  a  total  of  MR  outputs.  Since  the  first  processing  cycle  for  the 
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second  request  of  the  last  bank  occurs  as  the  last  bank  sends  the  first  output  back  to  the 
processor,  there  will  be  MR-1  cycles  with  no  output.  Therefore,  there  are  MR  cycles  with 
output  followed  by  MR-1  cycles  with  no  outputs.  Since  the  banks  accept  and  process  the 
requests  of  the  second  set  with  no  further  delay  (after  the  second  memory  request  of  the 
last  bank  is  accepted)  then  MR  outputs  occur  following  the  MR-1  period  of  no  outputs. 

At  this  point,  two  consecutive  memory  requests  are  encountered  by  the  first  bank  and  the 
pattern  repeats.  Therefore,  for  a  standard  interleaved  memory  system,  the  worst  case 
steady-state  throughput  is 
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for  B>  MR. 


(V.24) 


Under  these  circumstances,  the  upper  bound  of  the  maximum  latency  is  incurred 
by  the  second  consecutive  memory  request  to  a  bank.  This  memory  request  must  first 
wait  for  the  preceding  memory  request  to  be  processed  {MR  cycles),  followed  by  MR 
cycles  to  process  this  memory  request,  and  finally  a  cycle  to  return  the  memory  response. 
Therefore,  the  upper  bound  for  the  maximum  latency  for  constant-stride  address  patterns 
for  standard  interleaving  is 

<  2M/?  + 1  for  5  >  MR. .  (V.25) 


Since  all  of  the  banks  are  utilized,  a  STM  with  three  or  more  cache  elements  will 
provide  sufficient  buffering  to  ensure  full  utilization  of  all  of  the  banks.  Therefore,  the 
steady-state  throughput  for  a  STM  memory  is 

TP,,=—  for  B<MR  and  NoCE>3.  (V.26) 

The  latency  for  STM  memories  is  the  same  as  for  standard  interleaving  for 
constant-stride  address  patterns,  given  that  the  number  of  banks  is  matched  to  the 
memory  ratio  and  the  number  of  cache  elements  is  three  or  more.  The  only  difference 
between  standard  interleaving  and  a  STM  memory  is  that  the  second  memory  request  is 
not  accepted  by  the  memory  in  the  standard  interleaving  case  until  the  last  processing 
cycle,  whereas  the  STM  memory  will  accept  it  when  the  request  first  appears  on  the  bus 
(i.e.,  the  memory  request  is  not  accepted  for  first  MR  cycles  in  standard  interleaving  but  is 
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accepted  by  STM)  In  either  case,  there  are  MR  cycles  required  to  process  the  first  request 
followed  by  MR  cycles  to  process  the  second  request  and  a  cycle  to  return  the  processed 
memory  response.  Therefore,  the  upper  bound  of  the  maximum  latency  for  constant 
stride  patterns  for  both  the  standard  interleaving  as  well  as  for  STM  memories  is 

<lMR+\ioxB>  MR,  and  NoCE  >  3. .  (V.27) 

Now  consider  the  use  of  permutation-based  decoding  for  a  radix-r  butterfly 
address  stream.  The  radix-r  butterfly  addressing  pattern  provides  a  unique  bank  for  each 
of  the  inputs  to  a  single  radix-r  butterfly  calculation  because  this  is  a  sequence  of  constant 
stride  (5  =  Njr)  as  long  as  the  number  of  banks  is  greater  than  the  radix-r.  If  the  number 
of  banks  is  less  than  r,  the  bank  numbers  will  repeat  and  there  exists  the  possibility  that 
two  consecutive  bank  numbers  can  occur  when  crossing  over  a  block  boundary.  This 
situation  is  similar  to  the  constant  stride  case  where  the  last  bank  of  one  base  set  can  be 
the  first  bank  in  the  next  set.  Clearly  if  the  radix  is  smaller  than  the  number  of  banks, 
then  only  a  subset  of  the  banks  will  be  selected. 

The  major  concern  for  radix-r  butterfly  address  patterns  when  using  permutation- 
based  bank  decoding  is  the  relationship  between  the  sets  of  banks  selected  for  the 
butterfly  operations.  This  address  pattern  can  be  viewed  as  an  interleaving  of  r  streams  of 
constant  stride  of  one  address  pattern.  The  effect  of  this  r-way  interleaving  is  not  clear, 
given  the  current  set  of  constraints  on  the  address  stream,  namely  that  it  consists  of  a 
sequence  of  blocks  where  each  block  contains  a  permutation  of  the  bank  numbers.  The 
larger  the  value  of  r,  the  greater  the  potential  impact  to  the  desired  properties  of  the 
address  stream. 

To  clarify  this  last  point,  note  that  for  a  radix-2  butterfly,  two  constant-stride 
address  patterns  with  a  stride  of  one  are  interleaved.  Suppose  the  number  of  banks  is 
eight.  The  following  is  an  example  of  the  problems  possible  with  radix-2  addressing: 

Sequence#!:  {1,2,  3,  4...} 

Sequence  #2  { 2,  3, 4,  5  ... } 
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with  a  resulting  sequence  of  { 1,  2,  2,  3,  3, 4, 4,  5  ...  }•  A  worse  case  scenario  occurs  if 
both  sequences  are  on  a  boundary  with  a  repeating  bank  number  resulting  in  one  bank 
number  four  consecutive  times.  As  the  radix  increases,  the  potential  for  disrupting  the 
desired  pattern  increases. 

An  alternative  to  this  dilemma  is  to  construct  a  permutation  matrix  that  has 
properties  favorable  for  radix-r  addressing  patterns.  The  following  discussion  will 
describe  one  technique  for  constructing  such  a  matrix. 

The  matrices  designed  with  the  technique  described  below  are  tailored  both  to  the 
number  of  banks  as  well  as  to  the  stride  s.  The  use  of  tailored  matrices  requires  that  the 
permutation  matrix  be  loaded  prior  to  using  the  memory.  The  permutation  matrix  cannot 
be  changed  until  the  data  inside  the  memory  is  not  required  anymore.  A  review  of  Figure 
in.9  indicates  that  a  memory  engaged  in  the  radix-r  pattern  will  also  be  required  to  accept 
a  constant  stride  pattern  with  a  stride  of  one.  Therefore,  the  constraints  necessary  to 
ensure  good  performance  for  constant-stride  address  patterns  will  also  be  applied  to  the 
matrices  designed  for  radix-r  patterns. 

The  following  description  for  constructing  permutation-based  matrices  for  radix-r 
address  patterns  will  use  a  STM  that  has  16  banks  to  illustrate  the  construction  process. 
Other  STM  configurations  can  readily  be  constructed  by  applying  the  principles  described 
below. 


Figure  V.7  illustrates  the  desired  mapping  to  the  address  space  when  the  stride  of 
the  radix-r  butterfly  is  16  by  a  permutation  matrix  to  be  described  below.  The  address 
space  is  represented  by  the  matrix  with  column  order  (i.e.,  the  first  16  elements  of  the 
address  space  are  represented  by  the  first  column.  The  contents  of  each  element  is  its 
bank  number.  For  simplicity,  the  first  16  elements  are  mapped  with  the  identity  matrix  as 
indicated  in  the  figure  and  the  permutation  matrix  shown  in  the  following  equation 
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Note  that  only  the  first  eight  rows  of  the  matrix  are  filled  in  addition  to  the  first 
column,  and  that  the  bank  numbers  are  in  binary.  By  inspection  of  Equation  (V.28),  the 
permutation  matrix  consist  of  the  identity  matrix  (four  rightmost  columns)  preceded  by 
four  colunms  that  will  now  be  discussed.  The  first  mapping,  labeled  Ml  in  Figure  V.7,  is 
a  result  of  pq  5  =  1  in  Equation  (V.28)  followed  (in  the  order  in  which  they  are  applied  to 

the  sequence)  by  pj  4  =  1 ,  p2  3  =  1 ,  and  P3  2  =  1 ,  labeled  M2  through  M4  respectively. 
The  matrix  is  zero  indexed  with  the  origin  in  the  upper  left-hand  corner. 

As  indicated  before,  this  matrix  is  designed  for  a  stride  of  16.  Assuming  that  the 
first  element  of  the  sequence  is  at  the  origin,  the  sequence  making  up  this  stride  of  16 
consists  of  a  row-wise  ordering  of  the  matrix.  Thus  far,  the  radix  of  the  butterfly  has  not 
been  specified.  Suppose  first  that  r=16.  In  this  case,  the  first  butterfly  operation  will 
receive  the  first  row  of  the  matrix  in  Figure  V.7 ;  the  second  butterfly  operation  will 
receive  the  second  row,  etc.  In  this  instance,  it  can  be  seen  that  the  effect  of  mappings 
Ml  through  M4  is  to  permute  the  first  element  in  a  row  to  all  possible  bank  numbers. 
Therefore,  since  the  matrix  will  be  accessed  in  row  order,  an  address  stream  is  generated 
that  has  the  same  properties  as  that  of  an  address  stream  with  a  constant  stride  of  16.  The 
resulting  performance  of  this  radix-r  address  stream  should  be  consistent  with  a  constant 
stride  addressing  stream  described  above. 

Suppose  now  that  the  radix  is  not  16,  but  rather  two,  four,  or  eight  (these  are  the 
radices  of  interest  in  this  effort).  For  any  other  radix,  the  addressing  pattern  remains  row 
wise.  However,  only  the  first  r  elements  of  the  matrix  are  taken  for  each  butterfly 
operation.  Assuming  that  the  first  reference  is  at  the  upper  left-hand  comer  of  the  matrix, 
the  addresses  for  the  first  butterfly  operation  is  the  first  r  elements  of  the  first  row.  The 
next  butterfly  operation  uses  the  first  r  elements  in  the  second  row,  etc. 

Consider  first  a  radix-2  butterfly  address  stream.  The  Ml  mapping  results  in  the 
selection  of  each  bank  after  eight  butterfly  operations  or  16  memory  references.  This  is 
accomplished  by  the  Ml  mapping  by  toggling  the  bank  bit  for  the  base  sequence.  This 
maps  elements  0000  through  01 1 1  to  the  second  half  of  the  0000  through  1111  sequence, 
thereby  ensuring  all  sixteen  banks  are  accessed  with  the  radix-2  pattern.  An  inspection  of 
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Figure  V.7  will  verify  that  similar  statements  hold  for  radix-4  and  8  sequences  which  also 
require  the  M2  and  M3  mappings  respectively,  to  get  the  desired  result. 

Radix-r  sequences  with  strides  longer  than  sixteen  will  take  advantage  of  the 
permutation  matrix  elements  to  the  left  of  those  used  to  implement  mappings  Ml  through 
M4  (po.2  and  pj  3  are  shown  in  Equation  (V.28)).  The  effect  of  these  mappings  is  to 

map  the  results  of  Ml  through  M4  to  other  permutations.  However,  the  relationship 
between  the  banks  is  preserved  through  these  mappings,  and  therefore  the  desired 
properties  are  preserved. 

Recall  that  one  of  the  requirements  for  these  matrices  is  that  they  meet  the 
conditions  required  for  constant  stride  matrices.  In  particular,  all  sub-matrices  of  the 
permutation  matrix  of  dimension  n  by  «,  where  n  is  the  number  of  bits  required  to 
represent  the  bank  number,  must  be  nonsingular.  In  Equation  (V.28),  this  is  clearly  not 
the  case  because  each  row  contains  a  string  of  four  or  more  zeros.  This  can  be  easily 
fixed  however  by  inserting  zeros  at  positions  po,i>  Pi,i>  P6,2’  at  p-j  2,  which  satisfies 

the  requirements  for  constant  stride  matrices  while  maintaining  the  requirements  for 
radix-r  matrices. 

Two  additional  situations  must  be  addressed:  when  the  stride  is  less  than  the 
number  of  banks  and  when  the  stride  is  greater  than  the  number  of  banks.  First,  suppose 
that  the  stride  is  less  than  the  number  of  banks.  For  example,  if  the  stride  were  a  half  of 
the  number  of  banks,  the  first  and  the  eight  elements  of  the  base  sequence  address  would 
be  accessed.  In  this  situation,  the  permutation  matrix  needs  to  transform  these  two 
elements  to  the  remaining  elements.  Using  a  similar  strategy  as  that  above,  three 
mappings,  M 1 ,  M2,  and  M3  map  the  first  half  of  the  base  sequence  into  the  first  row  and 
the  second  half  of  the  base  sequence  into  row  eight,  as  shown  in  Figure  V.8  using  the 
permutation  matrix  of  Equation  (V.29). 
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A  radix- 16  sequence  of  banks  is  the  interleaving  of  the  first  and  eighth  rows 
{0000,  1000,0100,  1100,0010,  1010, ...  01 1 1,  1 1 1 1 }  for  the  first  radix.  The  second 
radix  operation  has  a  similar  pattern  for  the  second  and  ninth  rows. 

Observe  that  a  radix-2  pattern  will  produce  the  sequence  of  bank  numbers  (0000, 
1000,  0001,  1001,  0010  ... }  and  that  the  radix-4  sequence  results  in  (0000,  0100,  1000, 

1 100, 0001,  0101,  1001, 1 101, 0010  ... }.  Observe  that  in  each  case,  each  bank  number  is 
encountered  once  every  16  memory  references. 
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Figure  V.8  Mapping  Required  When  Stride  is  One  Half  the  Number  of  Banks 

The  last  situation  is  when  the  stride  is  greater  than  the  number  of  banks.  Only  the 
first  element  of  the  base  sequence  is  referenced  (as  in  the  case  when  the  stride  was  equal 
to  the  number  of  banks).  The  required  sequence  of  mappings  Ml  through  M4  is  shown  in 
Equation  (V.28).  However,  this  mapping  must  be  shifted  to  the  left  within  the  matrix. 

For  example,  if  the  stride  is  two  times  the  number  of  banks,  the  mappings  Ml  through 
M4  must  be  shifted  one  position  to  the  left  (four  times  two  positions,  etc.).  Such  a  matrix 
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is  shown  in  Equation  (V.30)  for  a  stride  of  four  times  the  number  of  banks  when  the 
number  of  banks  is  16.  The  two  columns  between  the  identity  matrix  and  the  mappings 
provide  the  necessary  shifting  of  the  mapping  matrix. 
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In  summary,  a  radix-r  addressing  pattern  requires  tailored  matrices  to  yield 
satisfactory  performance  for  radix  values  greater  than  two.  The  matrices  must  be  tailored 
to  the  stride  of  the  radix-r  address  pattern.  In  general,  the  three  cases  that  must  be  taken 
into  account  are  when  the  stride  is  less  than,  equal  to,  or  greater  than  the  number  of 
banks.  When  these  matrices  are  used,  the  performance  of  the  radix-r  address  patterns  are 
equivalent  to  those  of  constant  stride  with  respect  to  steady-state  throughput  and 
maximum  latency.  The  next  section  will  address  digit-reversed  address  patterns  when 
permutation-based  bank  decoding  is  used. 


The  steady-state  throughput  for  a  digit-reversed  pattern  is  primarily  governed  by 
maximum  stride  which  is  equal  to  the  place  value  of  the  most  significant  digit  of  the 
address  of  the  input  vector.  This  stride  is  repeated  r  times  where  r  is  the  radix  of  the 
butterfly  used  to  compute  the  fast  Fourier  transform.  These  constant  stride  sequences  of 

length  r  and  stride  ^ ,  where  k  is  the  number  of  digits  in  the  address,  are  concatenated 
together  to  form  the  digit-reversed  pattern. 

Permutation  based  decoding  will  ensure  that  the  banks  selected  within  a  constant 
stride  sequence  will  be  unique  up  to  the  number  of  banks.  If  the  radix  is  equal  to  or 
greater  than  the  number  of  banks,  then  each  set  of  constant  stride  sequences  will  contain 
an  equal  number  of  references  to  each  bank.  This  will  yield  a  steady-state  throughput  and 
maximum  latency  consistent  with  constant  stride  addresses. 

If  the  radix  is  smaller  than  the  number  of  banks,  then  unique  banks  will  be 
selected  within  each  r  length  sequence.  However  the  relationship  between  the  bank 
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numbers  between  sequences  is  not  known.  Therefore,  the  same  banks  could  be  selected 
again. 

The  large  number  of  permuted  patterns  suggested  by  Figure  HI.  12  and  the 
permutation  matrices  in  Figure  III.  13  through  Figure  HI.  15  suggest  that  a  variety  of  banks 
can  be  selected  under  these  circumstances. 

The  lower  bound  for  the  steady-state  throughput  in  this  situation  is 

TR, for  r  <  5  and  NoCE  >  3 .  (V.31) 

The  maximum  latency  is  governed  by  the  number  of  cache  elements  and  the  memory 
ratio  under  these  circumstances.  The  maximum  latency  is 

<  {NoCE  +  l)MR  - 1 .  (V.32) 

In  this  chapter,  performance  of  both  standard  interleaving  memories  as  well  as 
STM  memories  were  analyzed  first  for  conventional  memory  decoding  and  then  for 
permutation-based  memory  decoding.  Addressing  patterns  analyzed  include  constant 
stride,  radix-r  butterfly,  and  digit-reversed  addressing  patterns. 

Constant-stride  address  patterns  provide  optimum  performance  under 
conventional  decoding  when  the  stride  and  number  of  banks  is  relatively  prime.  The 
steady-state  throughput  is  1.0  and  the  maximum  latency  is  MR +  2.  However,  the 
architecture  in  Chapter  0  requires  strides  that  au'e  powers  of  two.  Address  streams  with 
these  strides  perform  poorly  using  conventional  decoding  as  described  in  Equation  (V.4). 
Performance  for  constant  stride  patterns  that  are  not  powers  of  two  is  not  specified  based 
on  the  theory  of  the  permutation  matrices  developed. 

When  constant-stride  address  patterns  with  strides  of  powers  of  two  are  applied  to 
a  STM  memory  with  permutation-based  decoding,  the  steady-state  throughput  is  optimal 
and  the  latency  increases  to  a  upper  bound  of  2MR  + 1 .  This  is  slightly  less  than  double 
that  incurred  with  conventional  decoding. 

Radix-r  address  patterns  yield  an  optimal  steady-state  throughput  for  all  radix 
values  (r  =  2, 4, 8,  and  16)  but  with  latencies  proportional  to  the  product  of  the  radix  and 
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the  memory  ratio.  Standard  interleaving  performed  very  poorly  for  this  case  because  this 
pattern  for  the  cases  of  interest  result  in  r  consecutive  hits  to  the  same  bank. 

Little  can  be  said  regarding  performance  when  using  the  permutation-based 
matrices  for  radix-r  butterfly  patterns.  However,  when  tailored  permutation  matrices  are 
used  for  radix-r  butterfly  patterns,  optimal  throughput  with  an  upper-bound  latency 
consistent  with  constant  stride  patterns  (i.e.,  2MR  +  \ )  are  predicted. 

Conventional  decoding  performs  poorly  for  digit  reverse  address  patterns  because 
the  digit-reversed  patterns  of  interest  are  characterized  by  sequences  of  length  r  constant 
stride  with  the  stride  a  power  of  two.  The  steady-state  throughput  is  expected  to  be 
inversely  proportional  to  the  number  of  banks  when  the  number  of  cache  elements  is 
small.  If  the  number  of  cache  elements  is  large  (i.e.,  ~  NjB)  then  the  average  throughput 
is 

B 

2B-1 

where  B  is  the  number  of  banks.  The  gain  in  throughput  is  obtained  by  a  substantial 
investment  of  hardware.  Standard  interleaving  is  also  expected  to  perform  poorly  with  a 
steady-state  throughput  inversely  proportional  to  the  number  of  banks  because  this  pattern 
is  characterized  by  long  sequences  to  a  single  bank. 

The  permutation-based  theoretical  results  are  mixed  when  applied  to  digit- 
reversed  address  patterns.  When  the  radix  is  equal  to  or  greater  than  the  number  of 
banks,  then  the  projected  performance  is  consistent  with  constant-stride  address  patterns 
using  permutation-based  techniques.  When  the  radix  is  less  than  the  number  of  banks,  a 
loose  lower  bound  expression  for  the  steady-state  throughput  is  r/MR  and  the  latency  is 
an  upper-bound  expression  that  is  proportional  to  the  product  of  the  number  of  cache 
elements  and  the  memory  ratio. 

E.  RANDOM  ADDRESSING 

As  indicated  in  Equation  (II.  10),  the  speedup  of  a  standard  interleaved  memory 
system  (i.e.,  one  without  any  buffering)  yields  a  speedup  that  is  approximately  the  square 
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root  of  the  number  of  banks.  It  is  desirable  to  know  the  impact  of  buffering  on  the 
performance  of  the  interleaved  memory  system,  given  that  the  input  address  stream  is 
random. 

Queuing  theory  provides  a  framework  for  analyzing  this  problem.  For 
background  information  on  this  topic  see  Trivedi  [Ref  53]  and  Allen  [Ref  54].  The 
following  discussion  is  based  on  an  address  stream  consistent  with  Equation  (II.  10), 
developed  by  Hellerman  [Ref  25],  that  assumes  each  bank  is  equally  likely  to  be  selected 
for  each  memory  address  issued  to  a  memory  system  that  contains  MR  banks.  If  the 
problem  is  modeled  in  a  queuing  theory  context,  each  of  the  banks  can  be  modeled  as  a 
queue  with  a  single  server  (i.e.,  the  bulk  storage  unit).  This  server  has  a  service 
distribution  that  is  deterministic  with  a  constant  service  cycle  time  of  MR . 

The  input  rate  to  the  memory  system  (i.e.,  all  of  the  banks)  is  one  request  per 
cycle.  The  equal  probability  assumption  on  bank  selection  results  in  a  geometric 
distribution  for  the  interarrival  time  with  a  mean  arrival  time  of  If  MR,  or  equivalently 
1/ B  where  B  is  the  number  of  banks.  Given  a  queue  length  of  k,  a  single  bank  can  be 
described  using  queuing  theory  notation  as  a  MfD/l/k  queuing  problem  where  M* 
represents  the  distribution  of  the  interarrival  time,  D  is  the  distribution  of  the  server  time, 
the  1  indicates  a  single  server,  and  as  indicated  above,  the  k  is  the  queue  length  of  the 
input  queue  to  the  server.  At  times  it  may  be  useful  to  assume  that  k  =  °°. 

There  are  several  features  of  this  problem  that  distinguish  it  from  traditional 
queuing  problems.  The  queue  length  is  finite  and  there  is  no  balking  (i.e.,  a  customer 
does  not  leave  a  line  no  matter  how  long  the  customer  must  wait).  Further,  since  read 
cycles  are  assumed,  the  order  in  which  the  requests  are  made  must  be  preserved  across  all 
of  the  banks.  The  implication  is  that  a  given  customer  cannot  leave  the  queue  until  all  of 
the  customers  that  preceded  it  leave  their  respective  queues.  This  can  lead  to  nonsensical 
situations  when  interpreted  as  a  typical  service  line  for  humans.  For  example,  it  is 
possible  for  a  bank  to  have  a  full  queue  of  processed  customers  that  are  waiting  for  the 


‘  M  is  generally  reserved  to  represent  the  exponential  distribution,  which  is  the  continuous  counterpart  to  the  discrete- 
time  geometric  distribution. 
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proper  turn  to  be  sent  back  to  the  processor.  Therefore,  even  though  the  queue  is  full,  the 
server  has  nothing  to  process  and  cannot  accept  new  requests  until  a  serviced  customer 
exits  the  queue. 

A  closed-form  solution  was  not  obtained  for  this  queuing  problem.  However,  the 
following  observations  are  made  concerning  this  process.  Because  the  number  of  banks 
is  matched  to  the  memory  ratio,  the  banks  must  be  fully  utilized  in  order  for  the  service 
rate  to  be  equal  to  the  input  rate.  This  is  possible  only  if  the  inputs  are  assigned  in  a 
round  robin  fashion  as  described  earlier  in  this  section.  The  random  nature  of  the  input 
stream  will  certainly  not  provide  this  type  of  assignment  and  therefore  the  service  rate 
will  be  less  than  the  input  rate.  So  long  as  the  input  rate  exceeds  the  service  rate,  the 
queue  length  will  grow  and  if  the  queue  length  is  modeled  as  infinite,  there  is  no  steady- 
state  solution.  If,  however,  finite  queue  lengths  are  assumed,  then  stalls  that  occur  when 
queues  fill  up  serve  to  regulate  the  input  and  a  steady-state  condition  is  obtained. 

One  experiment  will  be  generated  to  analyze  this  problem. 
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VI.  SIMULATION  STUDIES 

A.  OVERVIEW 

The  following  section  is  a  description  of  the  split  transaction  memory  (STM) 
simulations  executed  for  the  purposes  of  this  dissertation.  The  major  emphasis  of  these 
simulation  runs  is  to  verify  the  analytic  results  obtained  in  Chapter  V  and  to  provide  data 
for  making  architectural  choices  for  the  vector  processor  architecture.  A  secondary  goal 
is  to  explore  the  use  of  STM  for  general-purpose  computing. 

The  simulation  studies  are  organized  into  two  major  groups.  The  first  group  is 
concerned  with  vector  processing.  The  second  group  consists  of  a  single  experiment 
focused  on  general-purpose  computing.  Input  variables  pertinent  to  both  groups  include 
the  architectural  parameters  number  of  bank  (NoBanks),  number  of  cache  elements 
(NoCE),  and  memory  ratio  (MemRatio).  The  memory  decoding  scheme  (Mem Decode) 
is  an  important  parameter  for  the  vector  processor  simulations  but  not  those  concerning 
general-purpose  computing.  The  type  of  address  pattern  is  the  another  key  input  to  a 
simulation  run.  The  vector  processor  simulations  are  organized  by  the  three  address 
patterns  discussed  in  Chapter  V:  constant  stride,  radix-r  butterfly,  and  digit-reversed 
address  patterns.  The  random  address  pattern  is  analyzed  for  the  general-purpose  case. 

The  primary  measurements  of  performance  that  are  analyzed  for  both  simulation 
groups  are  the  steady-state  throughput  (SSTP)  and  the  maximum  latency  (ML).  Note 
that  all  of  the  performance  parameters  described  in  Section  D  of  Chapter  n  are  measured 
during  each  simulation  and  are  included  in  the  discussion  below  when  appropriate. 
Speedup  is  also  addressed  in  the  general-purpose  computing  simulation. 

The  vector  processing  experiments  are  summarized  in  Table  VI.  1 .  They  are 
organized  into  three  pairs,  each  corresponding  to  an  address  pattern.  Each  pair  first 
addresses  conventional  memory  decoding  followed  by  permutation-based  memory 
decoding. 

The  first  set  of  experiments  deals  with  constant-stride  address  patterns.  The  first 
experiment  is  designed  to  verify  the  problems  associated  with  using  conventional 
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decoding  when  the  stride  and  the  number  of  banks  are  not  relatively  prime.  This 
experiment  also  demonstrates  optimal  STM  performance  when  the  stride  and  the  number 
of  banks  are  relatively  prime  and  thereby  places  an  upper  bound  on  the  goodness  of  STM 
performance  for  the  remaining  experiments. 


Name 

Purpose 

Scope 

Constant  Stride 

(conventional 

decoding) 

Verify  constant  stride  analysis  for 
conventional  decoding. 

Stride  =  1,  2,  3,  4,  5,  6, 

7,  8,  and  9 

Constant  Stride 
(PB  decoding) 

Evaluate  the  performance  of  STM 
using  PB  for  constant-stride  address 
patterns  where  s=2^  k=l,2,.. 

Stride  =  1, 2,  3, 4,  5,  8, 

16,  32,  64,  and  128. 

Radix-r  Butterfly 

(conventional 

decoding) 

Verify  radix-r  analysis  for 
conventional  decoding. 

r  =  2, 4,  8,  and  16. 

Radix-r  Butterfly 
(PB  decoding) 

Evaluate  the  performance  of  STM 
using  PB  for  radix-r  butterfly  address 
patterns. 

r  =  2,  4,  8,  and  16. 

Digit  Reversal 

(conventional 

decoding) 

Verify  digit-reversed  analysis  for 
power  of  two  base  number  systems 
using  conventional  decoding. 

base  /  NoDigits=  2/10, 
4/5,  8/4,  and  16/4. 

Digit  Reversal 
(PB  decoding) 

Evaluate  the  performance  of  STM 
using  PB  for  digit-reversed  address 
patterns  for  power  of  two  base  number 
systems. 

base  /  NoDigits=  2/10, 
4/5,  8/4,  and  16/4. 

Table  VI.l  Vector  Processor  Experiments 


The  set  of  parameters  used  in  the  first  experiment  is: 

NoBanks  =  4,  8,  16,  32 

MemRatio  =  NoBanks  (VI.l) 

NoCE  =  1,  2,  3, 

where  NoCE=1  is  to  be  understood  as  a  single  buffer  in  standard  interleaving  rather  than 
a  STM  with  one  cache  element.  This  convention  will  be  assumed  hereafter  for  the 
following  experiments.  The  values  for  the  number  of  banks  and  the  corresponding  values 
for  the  memory  ratio  are  also  used  in  all  of  the  other  vector  processor  experiments.  The 
number  of  banks  is  matched  to  the  memory  ratio  based  on  the  premise  that  an  optimal 
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throughput  is  obtainable  without  increasing  the  number  of  banks  to  obtain  throughput. 
The  range  of  values  for  the  number  of  cache  elements  is  tailored  for  each  experiment. 

The  standard  interleaving  case  is  always  provided  for  comparison  (NoCE  =1)  to  the  STM 
cases.  The  number  of  cache  elements  that  is  expected  to  provide  an  optimum  steady-state 
throughput  based  on  the  analysis  in  Chapter  V  (NoCE  =2  for  the  experiment  above)  is 
also  included.  Additional  values  for  the  number  of  cache  elements  may  be  provided  to 
explore  the  sensitivity  of  the  performance  values  to  the  number  of  cache  elements  (NoCE 
=  3  above). 

This  experiment  is  designed  to  validate  expressions  for  the  steady-state 
throughput  and  latency  as  expressed  in  Equations  (V.4)  through  (VI.8).  Plots  generated 
based  on  these  equations  are  shown  in  Figure  VI.  1  through  Figure  VI.8.  Figure  VI.  1  and 
Figure  VI.2  illustrate  the  steady-state  throughput  and  maximum  latency,  respectively,  for 
those  strides  that  are  relatively  prime  to  the  number  of  banks  (i.e.,  strides  1,  3,  5,  7,  and 
9).  These  figures  show  the  best  performance  possible  for  an  interleaved  memory  system. 
Figure  VI.3  and  Figure  VI.4  reflex  the  throughput  and  latency  for  a  stride  of  two.  The 
steady-state  throughput  is  0.5  for  all  values  because  the  number  of  effective  banks  is  half 
of  the  total  number  of  banks,  which  is  in  turn  equal  to  the  memory  ratio.  A  similar 
relationship  holds  for  a  stride  of  four  except  the  effective  number  of  banks  is  one  fourth 
of  the  total  number  of  banks  as  shown  in  Figure  VI.5.  The  corresponding  maximum 
latencies  for  a  stride  of  four  are  reflected  in  Figure  VI.6.  The  steady-state  throughput  is 
slightly  more  complicated  for  a  stride  of  eight  because  the  effective  number  of  banks  is  a 
fourth  of  the  total  number  of  banks  when  the  number  of  banks  is  four.  For  the  cases 
where  the  number  of  banks  is  eight  and  sixteen,  the  effective  number  of  banks  drops  to 
one  eight  of  the  total  number  as  shown  in  Figure  VI.7.  This  is  due  to  the  greatest 
common  denominator  operation  in  Equation  (V.l).  The  corresponding  maximum 
latencies  are  shown  in  Figure  VI.8. 
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Figure  VI.4  Maximum  Latency  for  Stride=2, 6  (Conventional  Decoding) 
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Figure  VI.5  Steady-State  Throughput  for  Stride=4  (Conventional  Decoding) 


Figure  VL6  Maximum  Latency  for  Stride=4  (Conventional  Decoding) 
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Figure  VI.7  Steady-State  Throughput  for  Stride=8  (Conventional  Decoding) 


Figure  VI.8  Maximum  Latency  for  Stride=8  (Conventional  Decoding) 
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The  second  experiment  is  designed  to  validate  the  effectiveness  of  permutation- 
based  techniques  when  applied  to  constant-stride  address  streams.  Further,  the  effects  on 
latency  will  be  examined  carefully  because  the  latency  analysis  only  provides  an 
expression  for  an  upper  bound.  The  parameter  values  for  the  number  of  banks  and  the 
memory  ratio  is  the  same  as  in  the  previous  experiment.  The  values  used  for  the  number 
of  cache  elements: 

NoCE  =  1 ,  3,  4.  (VL2) 

The  first  value  provides  for  the  standard  interleaving  case.  A  value  of  three  is  the 
value  required  for  optimal  steady-state  throughput.  The  value  of  four  is  added  for 
sensitivity  analysis. 

This  experiment  is  designed  to  validate  expressions  for  the  steady-state 
throughput  and  latency  as  expressed  in  Equations  (VI.24)  through  (VI.27).  Plots 
generated  based  on  these  equations  are  shown  in  Figure  VI.9  and  Figure  VI.  10.  Figure 
VI.9  illustrates  the  steady-state  throughput  of  unity  for  all  strides  that  are  a  power  of  two 
when  permutation-based  decoding  is  used.  The  corresponding  upper  bound  of  the 
maximum  latency  is  shown  in  Figure  VI.  10.  Note  that  no  theoretical  results  exist  for 
strides  that  are  not  a  power  of  two  for  permutation-based  memory  decoding.  Two  strides 
(stride=3  and  5)  are  simulated  to  provide  exemplar  performance  parameters  when  the 
stride  is  not  a  power  of  two.  Although  it  is  desirable  for  all  strides  to  yield  optimal 
performance,  those  strides  which  are  not  powers  of  two  are  not  required  for  the  vector 
architecture  described  in  Chapter  0. 
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Figure  VI.9  Steady-State  Throughput  for  Stride=2*  for  k  =  0,1,2  ...  (Permutation- 

Based  Decoding) 


Figure  VI.IO  Maximum  Latency  for  Stride=2*  for  =  0,1,2  ...  (Permutation-Based 


Decoding) 


The  third  experiment  is  designed  to  verify  the  steady-state  throughput  and 
maximum  latency  for  radix-r  butterfly  address  patterns  when  conventional  decoding  is 
used.  Expressions  for  steady-state  throughput  and  maximum  latency  are  found  in 
Equations  (VI.12),  (VI.13),  (VI.15),  and  (VL16).  The  number  of  banks  and  memory  ratio 
parameters  are  identical  to  those  used  above.  The  values  used  for  the  number  of  cache 
elements  are  adjusted  for  each  radix.  In  general,  the  number  of  cache  elements  must  be 
equal  to  r+1  where  r  is  the  radix  value.  The  values  used  for  each  radix  in  indicated  in 
Table  VI.2.  As  in  the  previous  experiments,  the  standard  interleaving  case  is  included  as 
well  as  the  value  that  the  analysis  indicates  will  provide  optimum  steady-state  throughput. 


Radix 

NoCE  Evaluated 

2 

1,3,4 

4 

1,5,6 

8 

1,9,  10 

16 

1,  17,  18 

Table  VI.2  NoCE  Evaluated  in  the  Third 
Vector  Processor  Experiment 

The  plots  for  these  theoretical  results  are  shown  in  Figure  VI.  1 1  through  Figure 
VI.15.  Figure  VI.  1 1  illustrates  the  theoretical  steady-state  throughput  for  radix-2  butterfly 
address  patterns  for  conventional  decoding.  Note  that  for  the  standard  interleaving  cases, 
the  value  represents  a  lower  bound.  The  remaining  steady-state  throughput  plots  (radix-4, 
8,  and  16)  are  not  shown  because  the  variation  in  these  plots  is  within  four  percent  of  that 
shown  in  Figure  VI.  1 1  and  that  variation  occurs  only  for  the  standard  interleaving  cases. 
The  upper  bound  for  the  maximum  latencies  are  shown  in  Figure  VI.12  through  Figure 
VI.15  for  radices  of  two,  four,  eight,  and  16  respectively.  Although  the  basic  shape  of 
these  plots  are  similar,  the  scale  is  seen  to  increase  as  the  value  of  the  radix  increases. 
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Theoretical  Maximum  Latency 


Figure  VI.15  Maximum  Latency  for  Radix=16  (Conventional  Decoding) 

The  fourth  vector  processor  experiment  is  similar  to  the  previous  experiment 
except  that  permutation-based  decoding  is  used  and  the  values  selected  for  the  number  of 
cache  elements  are  adjusted  to  yield  optimum  steady-state  throughput  for  radix-r  address 
patterns  for  tailored  permutation-based  memory  encoding.  The  values  used  for  the 
number  of  cache  elements  are 

NoCE  =  1,  2,  3,  4 

for  all  radices.  Pertinent  performance  expressions  are  found  in  Equations  (VI.24)  through 
(VI.27).  These  are  the  equations  for  constant  stride  but  are  also  appropriate  for  radix-r 
butterfly  patterns  when  the  specialized  matrices  are  used. 

The  plots  for  these  theoretical  results  are  shown  in  Figure  VI.  16  and  Figure  VI.  17. 
Figure  VI.  16  illustrates  the  theoretical  lower  bound  for  steady-state  throughput  for  radix-r 
butterfly  address  patterns  for  all  radices  for  permutation-based  decoding.  The  maximum 
latency  plot  for  all  radices  is  shown  in  Figure  VI.  17.  Observe  that  the  maximum  latency 
for  this  case  is  anticipated  to  be  substantially  lower  than  for  the  conventional  decoding  for 
the  higher  radices. 
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The  fifth  and  sixth  vector  processor  experiments  use  digit-reversed  address 
patterns.  The  fifth  experiment  uses  conventional  decoding.  Digit-reversed  address 
patterns  are  characterized  by  r-length  sequences  of  constant  stride  where  r  is  the 

radix  of  the  FFT  and  NoDigits  is  the  number  of  digits  required  to  represent  the  address  of 
the  vector.  Since  the  effective  number  of  banks  is  governed  by  Equation  (V.l)  the 
effective  number  of  banks  is  always  one  when  r  and  the  number  of  banks  are  both  a 
power  of  two.  The  following  data  set  is  used  to  validate  this  result: 

NoBanks  =  4,  8,  16,  32 
MemRatio  =  NoBanks 
NoCE  =  1 ,  3,  4. 

The  theoretical  steady-state  throughput  is  illustrated  in  Figure  VI.  18.  Observe  that 
the  steady-state  throughput  is  inverse  of  the  number  of  banks  and  this  value  is  invariant  to 
the  number  of  cache  elements. 


Figure  VI.18  Steady-State  Throughput  for  Radix=2  (Conventional  Decoding) 


The  sixth  experiment  evaluates  several  digit-reversed  patterns  using  permutation- 
based  memory  decoding.  Expressions  for  steady-state  throughput  and  maximum  latency 
are  found  in  Equations  (VI.31)  and  (VL32),  as  well  as  (VI.24)  through  (VI.27).  The 
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number  of  banks  and  memory  ratio  parameters  are  identical  to  those  used  above.  The 
values  used  for  the  number  of  cache  elements  are  shown  in  Table  VI.3.  In  general,  the 
number  of  cache  elements  used  in  this  experiment  is  the  same  as  those  used  in  the  second 
experiment.  Note  that  the  value  of  five  was  added  for  radix  eight  and  sixteen  after  one 
iteration  of  simulations.  This  will  be  discussed  further  in  the  next  section. 


Radix/NoDigits 

NoCE  Evaluated 

2/10 

1,3,4 

4/5 

1,3,4 

8/4 

1,3,  4,  5 

16/3 

1,3,  4,  5 

Table  VI.3  NoCE  Evaluated  in  the 
Sixth  Vector  Processor  Experiment 

The  theoretical  results  for  the  steady-state  throughput  and  maximum  latency  for 
the  four  cases  shown  in  Table  VI.3  are  shown  in  Figure  VI.  19  through  Figure  VI.26.  The 
steady-state  throughput  plots  are  lower  bounds  except  when  the  throughput  is  optimum. 
The  lower  bound  occurs  whenever  the  base  is  less  than  the  number  of  banks  as  described 
in  Chapter  V  Section  D. 
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Figure  VI.19  Steady-State  Throughput  for  Radix=2/NoDigits=10  (Permutation- 

Based  Decoding) 


Upper  Bound  for  Maximum  Latency 


Figure  VI.20  Maximum  Latency  for  Radix=2/NoDigits=10  (Permutation-Based 

Decoding) 


Figure  VI.21  Steady-State  Throughput  for  Radix=4/NoDigits=5  (Permutation 

Based  Decoding) 


Figure  VI.23  Steady-State  Throughput  for  Radix=8/NoDigits=4  (Permutation 

Based  Decoding) 


Figure  VI.24  Maximum  Latency  for  Radix=8/NoDigits=4  (Permutation-Based 

Decoding) 
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Figure  VL25  Steady-State  Throughput  for  Radix=16/NoDigits=3  (Permutation 

Based  Decoding) 


The  second  group  of  experiments  pertain  to  general-purpose  computing  and  are 
summarized  in  Table  VI.4.  This  experiment  examines  the  marginal  effectiveness  of 
adding  additional  memory  banks  and  cache  elements.  The  number  of  memory  banks  is 
matched  to  the  memory  ratio  such  that  if  the  memory  banks  are  used  optimally,  then  the 
speedup  obtained  from  the  memory  system  is  equal  to  the  number  of  banks.  The  address 
stream  is  completely  random  to  allow  comparison  with  results  in  the  literature  [Ref  55]. 
The  parameters  used  for  this  experiment  are 

NoBanks  =  1,  4,  8,  16,  32 

MemRatio  =  NoBanks 

(VI.3) 

NoCE  =  1,  2,  4,  8,  16,  32,  64 

p  =  0. 


Name 

Purpose 

Scope 

Speedup  Analysis 

Investigate  the  affect  to  speedup 
when  varying  STM  parameters. 

Set  p=0  for  historical  comparison. 

MemRatio  ==  NoBanks 
for  all  cases. 

Table  VL4  General-Purpose  Computer  Experiment 

The  next  two  sections  contains  the  results  of  each  of  the  simulation  runs  described 
above  for  vectoring  processing  and  general-purpose  computing  respectively. 


B.  VECTOR  PROCESSING  EXPERIMENTS 

1.  Constant  Stride:  Conventional  Memory  Decoding 

A  comparison  of  the  theoretical  and  simulated  results  for  the  first  vector  processor 
experiment  are  shown  in  Figure  VI.27  through  Figure  VI.39.  The  plots  for  stride  of  one 
are  shown  in  Figure  VI.27  and  Figure  VI.28.  For  both  of  these  performance  measures, 
the  theoretical  and  simulated  results  are  identical. 

Examples  of  two  simulation  runs,  the  first  with  four  banks  and  two  cache 
elements  and  the  second  with  32  banks  and  two  cache  elements  are  shown  in  Figure 
VI.29  and  Figure  VI.30  respectively.  In  each  plot,  the  grant  request  line  (GR)  indicating 
memory  requests  are  accepted  by  the  memory,  is  active  on  the  first  cycle  and  remains 
active  until  all  memory  responses  are  accepted.  The  response  enable  (RE)  line  becomes 
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active  indicating  that  output  is  available  for  the  processor,  and  remains  on  until  the  last 
response  is  sent  to  the  processor.  In  each  case,  the  RE  line  follows  the  GR  line  after  the 
necessary  latency  of  six  and  34,  respectively  (i.e.,  MR+2).  This  is  the  best  performance 
that  can  be  obtained  from  the  memory  systems.  One  of  the  tradeoffs  of  using  a  larger 
number  of  memory  banks  is  the  latency,  as  illustrated  in  Figure  VI.29  and  Figure  VI.30. 
This  latency  results  in  a  average  throughput  of  0.9624  and  0.795  for  four  versus  32  banks 
respectively.  Figure  VI.3 1  illustrates  the  effect  on  average  throughput  when  varying  the 
number  of  banks  for  the  case  of  a  stride  of  one.  The  penalty  of  a  larger  number  of  banks 
is  clearly  shown  when  the  vectors  are  relatively  small  (128  points). 
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STM  Status  Throughput 


PlotlD:s1cn  #  Banks=4  #  CEs=2  Mem  Ratio=4 


Figure  VI.29  Detailed  Simulation  Run  for  Stride=l  STM(4,2,4) 
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Figure  VI.30  Detailed  Simulation  Run  for  Stride=l  STM(32,2,32) 
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Figure  VI.31  Average  Simulated  Throughput  for  Stride=l  (Conventional 

Decoding) 

A  comparison  of  the  theoretical  versus  simulated  steady-state  throughput  and 
maximum  latency  is  shown  in  Figure  VI.32  and  Figure  VI.33,  respectively,  for  a  stride  of 
two.  Notice  that  the  simulated  steady-state  throughput  varies  by  as  much  as  four  percent 
from  the  theoretical  for  thirty  two  banks.  The  steady-state  throughput  is  calculated  by 
taking  the  average  of  the  last  twenty  five  percent  of  the  throughput  values.  This 
occasionally  results  in  a  bias  error  when  the  steady-state  value  of  the  throughput  is  not 
constant.  Such  a  steady-state  is  illustrated  in  Figure  VI.34  for  thirty  two  banks  and  three 
cache  elements. 

The  simulated  maximum  latency  is  in  agreement  with  the  theoretical  plot  for 
stride  of  two  as  shown  in  Figure  VI.33.  The  average  throughput,  shown  in  Figure  VI.35, 
indicates  a  consistent  pattern  with  a  stride  of  one.  The  average  throughput  obtained  with 
four  banks  (approximately  0.495)  is  within  one  percent  of  the  steady-state  ceiling  of  0.5. 
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Figure  VI.33  Comparison  of  Theoretical  Versus  Simulated  Maximum  Latency  for 

Stride=2  (Conventional  Decoding) 


Figure  VI.35  Average  Simulated  Throughput  for  Stride=2  (Conventional 

Decoding) 

A  comparison  of  the  theoretical  and  simulated  steady-state  throughput  and 
maximum  latency  for  stride  of  four  is  shown  in  Figure  VI.36  and  Figure  VI.37 
respectively.  The  results  are  similar  to  that  for  stride=2.  The  simulated  steady-state 
throughput  varies  by  less  than  two  percent  from  the  theoretical  results  and  the  simulated 
and  theoretical  maximum  latencies  are  identical. 

A  comparison  of  the  theoretical  and  simulated  steady-state  throughput  and 
maximum  latency  for  stride=8  is  shown  in  Figure  VI.38  and  Figure  VI.39,  respectively. 
The  results  are  also  similar  to  those  above.  However,  the  simulated  steady-state 
throughput  as  well  as  the  maximum  latency  is  identical  to  the  theoretical  results. 
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In  conclusion,  the  first  experiment  verifies  the  theoretical  expressions  for  steady- 
state  throughput  and  maximum  latency  for  constant-stride  address  patterns.  These 
address  patterns  illustrate  optimal  results  from  interleaved  memory  systems  as  well  as 
substantial  performance  degradation  when  the  stride  is  not  relatively  prime  to  the  number 
of  banks.  STM  memories  and  standard  interleaved  memories  generally  have  equivalent 
steady-state  throughput  performance  for  constant-stride  address  patterns  when  compared 
to  standard  interleaving.  Further,  when  the  stride  is  not  relatively  prime,  STM  memories 
incur  more  latency  than  standard  interleaving. 

For  the  architecture  presented  in  Chapter  0  for  FFT  computation,  those  strides  that 
are  not  relatively  prime  are  the  strides  required  rather  than  those  that  are  relatively  prime. 
The  following  experiment  is  used  to  validate  performance  when  strides  are  not  relatively 
prime  when  permutation-based  decoding  is  used. 

2.  Constant  Stride:  Permutation-Based  Memory  Decoding 

An  analysis  of  the  second  experiment  will  be  divided  into  strides  that  are  a  power 
of  two  (e.g.,  one,  two,  four,. . .)  and  a  selected  set  of  strides  not  a  power  of  two  (e.g.,  three 
and  five).  It  is  important  to  recall  that  the  estimates  for  the  maximum  latency  for 
permutation-based  memories  are  always  upper-bound  estimates.  The  theoretical  steady- 
state  throughput  results  for  the  radix-r  butterfly  and  digit-reversed  address  patterns  are 
also  lower  bounds.  However,  for  address  patterns  with  constant  stride,  the  theoretical 
results  are  exact  for  the  STM  cases  (i.e.,  when  the  number  of  cache  elements  is  greater 
than  one). 

A  comparison  of  the  theoretical  and  simulated  results  for  strides  of  one,  two,  and 
sixty  four  are  shown  in  Figure  VI.40  through  Figure  VI.45.  This  selection  of  strides  is 
presented  as  a  representative  of  all  of  the  strides  of  powers  of  two  possible,  given  the 
permutation  matrices  used  in  the  simulation  and  documented  in  Figure  HI.  13  through 
Figure  111.16. 

The  most  striking  characteristic  of  the  plots  contained  in  Figure  VI.40  through 
Figure  VI.45  is  that  they  are  for  all  practical  purposes  the  same.  Plots  for  stride  of  one  are 
shown  in  Figure  VI.40  and  Figure  VI.41.  The  simulated  steady-state  throughput  is 
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identical  to  the  theoretical  for  all  STM  cases  (they  are  always  optimum).  The  theoretical 
standard  interleaving  cases  are  characterized  by  a  decreasing  lower-bound  steady-state 
throughput  as  the  number  of  banks  increases.  Recall  that  this  lower  bound  is  based  on  the 
possibility  that  a  bank  will  receive  two  consecutive  requests  at  the  boundaries  of  the  base 
sequences  (refer  to  Chapter  V  Section  D  for  a  discussion  of  this  topic).  Figure  VI.40 
suggests  that  with  the  permutation  matrices  used,  this  is  not  the  case.  Further,  as  the 
number  of  banks  increases,  the  probability  decreases. 

Several  observations  can  be  made  concerning  the  theoretical  and  simulated 
maximum  latencies  shown  in  Figure  VI.41.  First,  the  basic  shape  of  the  theoretical  and 
simulated  maximum  latencies  are  similar  in  that  they  both  increase  with  the  number  of 
banks  and  are  relatively  invariant  to  the  number  of  cache  elements.  The  simulated 
maximum  latency  is,  however,  equal  to  the  theoretical  maximum  for  four-bank  memory 
with  a  substantial  differential  between  them  for  the  32-bank  memory.  This  is  due  to  the 
relative  length  of  the  input  vector  to  the  number  of  banks.  This  issue  will  be  explored 
more  fully  in  the  permutation-based  digit-reversed  experiment  below. 

Examples  of  two  simulation  runs,  the  first  with  four  banks  and  three  cache 
elements  and  the  second  with  thirty  two  banks  and  three  cache  elements  are  shown  in 
Figure  VI.46  and  Figure  VI.47,  respectively.  It  is  not  possible  to  anticipate  when  latency 
will  be  incurred.  For  a  small  number  of  banks,  it  is  more  likely  that  the  maximum  latency 
will  be  incurred  earlier  than  in  a  memory  configured  with  more  memory  banks  because 
the  likelihood  of  two  banks  being  close  is  greater  when  the  number  of  banks  is  small. 
Observe  the  distribution  of  the  latency  in  Figure  VI.46  versus  Figure  VI.47.  In  the  first 
plot  with  four  banks,  the  maximum  latency  is  incurred  early  in  the  run  in  contrast  to  the 
thirty  two  bank  simulation  where  the  maximum  latency  (in  the  plot)  is  not  obtained  until 
half  way  through  the  simulation  mn.  In  Figure  VI.48,  the  simulation  is  mn  with  an  input 
vector  of  1,024  and  it  can  be  seen  that  the  maximum  latency  is  not  reached  until 
approximately  cycle  600. 

The  tradeoff  between  latency  and  the  number  of  memory  banks  can  be  seen  in  the 
detailed  plots  by  the  amount  of  time  required  to  obtain  the  first  memory  response.  The 
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ratio  of  time  between  the  minimum  latency  and  the  length  of  the  input  vector  dictates  the 
best  average  throughput  (i.e.,  the  greater  the  ratio,  the  greater  the  penalty).  This  latency 
results  in  a  average  throughput  of  0.78,  as  shown  above  the  throughput  plot  in  Figure 
VI.47.  This  reflects  a  modest  increase  in  latency  from  the  conventional  decoding  case. 
Figure  VI.49  illustrates  the  effect  of  varying  the  number  of  banks  on  average  throughput 
for  the  case  of  stride=64.  The  penalty  of  a  larger  number  of  banks  is  clearly  shown  for 
STM  memories  when  the  vectors  are  relatively  small.  In  general,  STM  performance  is 
better  than  standard  interleaving  except  when  the  number  of  banks  is  32,  where  the 
performance  is  approximately  the  same. 

The  steady-state  throughput  and  average  throughput  for  strides  of  three  and  five 
are  illustrated  in  Figure  VI.50  and  Figure  VI.51.  Clearly  the  throughput  is  diminished 
from  the  powers  of  two  cases  shown  above.  The  steady-state  throughput  for  stride  of 
three  falls  steadily  as  the  number  of  banks  increases,  whereas  the  stride  of  five  case  has 
the  same  steady-state  throughput  for  four  and  32  banks  but  reduced  throughput  for  16 
banks.  These  figures  seem  to  confirm  the  erratic  behavior  of  permutation-based 
performance  for  strides  that  are  relatively  prime  to  strides  of  two.  The  average 
throughput  for  four  and  eight  bank  systems  suggest  moderate  performance  that  might  be 
tolerated  if  the  address  pattern  was  not  frequently  used. 
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Figure  VI.42  Comparison  of  Theoretical  Versus  Simulated  Steady-State 
Throughput  for  Stride=2  (Permutation-Based  Decoding) 
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Figure  VI.44  Comparison  of  Theoretical  Versus  Simulated  Steady-State 
Throughput  for  Stride=64  (Permutation-Based  Decoding) 
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STM  Status  Throughput  Latency 


Figure  VI.46  Detailed  Simulation  Run  for  Stride=64  STM(4,3,4)  (Permutation- 

Based  Decoding) 
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Plot  ID:  s64pb  #  Banks=32  # 


Simulated  Throughput 


Figure  VL49  Simulated  Average  Throughput  for  Stride=64  (Permutation-Based 

Decoding) 

This  experiment  demonstrates  that  permutation-based  decoding  can  provide 
favorable  performance  for  address  patterns  with  a  constant  stride  of  powers  of  two. 
Specifically,  the  steady-state  throughput  is  optimum  when  the  number  of  cache  elements 
is  at  least  three.  Further,  this  is  accomplished  with  a  modest  increase  in  the  latency  when 
the  vector  length  is  small  (e.g.,  128  in  the  examples).  For  larger  vector  lengths  where  the 
maximum  latency  is  realized,  the  increase  in  the  latency  is  approximately  doubled. 

The  next  section  will  address  radix-r  butterfly  address  patterns  when  conventional 
decoding  is  in  place. 
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3. 


Radix-r  Butterfly:  Conventional  Memory  Decoding 


A  comparison  of  the  theoretical  versus  simulated  steady-state  throughput  and 
maximum  latency  are  shown  in  Figure  VI.52  through  Figure  VL59  for  radices  two,  four, 
eight,  and  16.  The  simulated  steady-state  throughput  is  in  agreement  with  the  theoretical 
results  for  all  radices.  An  inspection  of  Figure  VI.52  reveals  that  all  STM  memories  yield 
a  throughput  of  1 .0  as  expected.  Standard  interleaving  cases  suffer  significant 
degradation  because  each  bank  is  given  consecutive  memory  requests  equal  to  the  radix. 
The  greater  the  number  of  banks,  the  more  banks  there  are  not  performing  as  indicated  for 
all  radices. 

The  simulated  maximum  latency  is  in  complete  agreement  for  the  radix-2  and 
radix-4  cases  as  shown  in  Figure  VI.53  and  Figure  VI.55,  respectively.  However, 
variances  occur  for  both  the  radix-8  (32  banks)  and  radix- 16  (16  and  32  banks)  cases.  In 
both  cases,  the  maximum  latency  rather  than  continuing  to  rise  as  the  expressions  would 
suggest,  flatten  out.  This  phenomena  is  due  to  the  relationship  between  the  “stride”  in 
effect  for  radix-r  patterns  and  the  expression  for  the  effective  number  of  banks. 

Whenever  the  effective  stride  becomes  smaller  than  the  number  of  banks,  then  the  latency 
will  be  reduced.  For  example,  for  the  radix- 16  case,  the  effective  stride  is 

r  t 

and  the  effective  number  of  banks  is 


B  -  ^ 

^  gcdiB,S)  gcd(16,8) 


=  2. 


Recall  that  B^^  under  most  circumstances  is  one  when  the  radix  and  number  of  banks  is  a 

power  of  two.  If  N  is  doubled,  the  expression  for  latency  used  to  constmct  the  theoretical 
plots  will  be  valid. 
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Figure  VI.54  Comparison  of  Theoretical  Versus  Simulated  Steady-State 
Throughput  for  Radix=4  (Conventional  Decoding) 


Figure  VI.55  Comparison  of  Theoretical  Versus  Simulated  Maximum  Latency  for 

Radix=4  (Conventional  Decoding) 
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One  last  issue  to  address  for  conventional  decoding  of  radix-r  butterfly  address 
patterns  is  the  average  throughput.  Average  throughput  of  a  radix-2  and  radix- 16  address 
patterns  are  displayed  in  Figure  VI. 60  and  Figure  VI.61 .  The  radix-2  butterfly  yields 
average  throughput  values  between  0.9  and  0.7,  approximately  five  to  ten  percent  lower 
than  the  constant  stride  patterns.  The  radix- 16  butterfly  average  throughput  is 
considerably  worse  beginning  at  0.7  with  a  lower  end  of  0.5.  The  lower  end  would,  of 
course,  be  worse  for  longer  vectors.  Therefore,  this  must  be  taken  into  account  when 
calculating  the  efficiency  of  the  vector  processor  or  when  determining  the  most  effective 
combination  of  radix  passes  to  use  for  a  given  length  vector. 

This  experiment  validates  that  conventional  decoding  coupled  with  STM  with  a 
sufficient  number  of  banks  will  provide  an  optimum  throughput,  but  at  higher  latencies 
than  encountered  with  either  conventional  or  permutation-based  decoding  of  constant 
strides.  The  following  section  will  investigate  permutation-based  decoding  of  radix-r 
address  patterns. 


Simulated  Throughput 


Figure  VI.60  Average  Throughput  for  Radix-2  Butterfly  Pattern  (Conventional 

Decoding) 
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Figure  VI.61  Average  Throughput  for  Radix-16  Butterfly  Pattern  (Conventional 

Decoding) 


4.  Radix-r  Butterfly:  Permutation-Based  Memory  Decoding 

The  theoretical  and  simulated  results  for  the  steady-state  throughput  and 
maximum  latency  are  illustrated  for  radices  two,  four,  eight,  and  16  in  Figure  VI. 62 
through  Figure  VI.71.  The  theoretical  steady-state  results  are  lower  bounds  for  the 
standard  interleaving  case.  The  theoretical  maximum  latency  is  an  upper  bound  for  all 
values.  The  simulation  runs  were  executed  with  the  tailored  permutation  matrices  for 
radices  four,  eight,  and  16.  Although  it  is  possible  to  develop  a  tailored  permutation 
matrix  for  radix  2,  radix  2  patterns  yield  good  performance  without  it.  Further,  there  are 
operational  constraints  that  make  it  desirable  not  to  have  a  specialized  permutation  matrix 
for  radix  2.  For  more  details,  see  the  conclusions  in  Chapter  VII. 

The  simulated  values  for  steady-state  throughput  and  maximum  latency  are  shown 
in  Figure  VI.62  and  Figure  VI.63,  respectively.  The  steady-state  throughput  is  1.0  for  all 
STM  simulations  with  three  or  more  cache  elements,  except  for  the  eight  bank  cases  with 
three  and  four  cache  elements  which  have  a  steady-state  throughput  of  0.89  and  0.96, 
respectively.  The  detailed  simulation  runs  for  eight  banks  with  three  and  four  cache 
elements  are  shown  in  Figure  VI.64  and  Figure  VI.65  respectively.  The  three-cache- 
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element-configured  simulation  reveals  the  GR  becoming  inactive  for  approximately  15 
cycles  indicating  an  insufficient  number  of  cache  elements.  The  additional  cache  element 
in  Figure  VI.65  eliminates  all  but  two  of  these  cycles.  The  radix-2  set  of  simulations 
represent  a  situation  where  a  few  more  cache  elements  may  be  useful  even  though  they 
are  not  needed  most  of  the  time.  Figure  VI.63  reveals  that  the  simulated  maximum 
latency  is  consistent  with  constant  stride  upper  bound  except  for  the  eight-bank  cases. 
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Figure  VL63  Comparison  of  Theoretical  Versus  Simulated  Maximum  Latency  for 


STM  Status  Throughput  Latency 


Figure  VI.64  Detail  Simulation  Run  for  Radix -2  STM(8,3,8)  (Permutation-Based 

Decoding) 
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The  steady-state  throughput  for  radix  four,  eight  and  16  shown  in  Figure  VI.66, 
Figure  VI.68,  and  Figure  VL70  reveal  an  ideal  steady-state  throughput  for  all  STM  cases 
where  the  number  of  cache  elements  is  three  or  four,  as  the  theory  predicts.  There  is 
substantial  degradation  for  most  standard  interleaving  cases.  A  two-cache-element  STM 
performs  better  for  a  larger  number  of  banks  and  poorly  for  four  bank  scenarios.  The 
maximum  latency  was  equal  to  the  theoretical  upper  bound  in  all  cases,  except  for  the  32- 
bank  configurations  for  radix  eight  and  16  simulations  as  shown  in  Figure  VI.67,  Figure 
VI.69,  and  Figure  VI.71. 

In  summary,  with  the  aid  of  permutation  matrices  tailored  to  the  stride  between  a 
radix  butterfly  operation,  the  resulting  pattern  has  the  features  of  a  constant  stride  pattern 
resulting  in  an  optimal  steady-state  throughput  when  at  least  three  cache  elements  are 
present.  Further,  the  maximum  latency  is  limited  to  approximately  twice  the  ideal  latency 
for  interleaved  memory  systems.  These  results  apply  to  radices  of  four,  eight,  and  16. 
Radix-2  butterfly  patterns  were  found  to  yield  good  performance  although  not  quite  as 
good  as  with  the  tailored  matrices  using  a  generic  constant  stride  permutation  matrix  as 
shown  in  Figure  HI.  13  through  Figure  III.  16. 

The  next  section  will  describe  the  performance  obtained  when  applying 
conventional  decoding  to  digit-reversed  address  patterns. 


193 


Figure  VI.69  Comparison  of  Theoretical  Versus  Simulated  Maximum  Latency  for 


Radix=8  (Permutation-Based  Decoding) 
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5.  Digit  Reversed:  Conventional  Memory  Decoding 

One  experiment  was  conducted  to  demonstrate  the  performance  when  using 
conventional  decoding  for  digit-reversed  address  patterns.  The  input  stream  was  a  digit- 
reversed  pattern  for  a  radix  of  two  with  ten  digits,  yielding  a  stride  of  2^  for  the  radix 
operation.  A  comparison  of  the  theoretical  versus  simulated  steady-state  throughput  is 
shown  in  Figure  VI.72.  The  theoretical  results  matches  the  simulated  results  perfectly.  A 
steady-state  throughput  is  obtained  that  is  the  reciprocal  of  the  number  of  banks  and  is 
independent  of  the  number  of  cache  elements. 

A  detailed  simulation  for  a  memory  with  four  banks  and  three  cache  elements  is 
shown  in  Figure  VI.73.  An  examination  of  the  STM  Status  plot  indicates  that  the  RE  line 
is  active  once  every  four  cycles  yielding  a  throughput  of  0.25.  Note  that  the  GR  signal  is 
active  for  a  short  period  of  time  allowing  the  one  active  bank’s  cache  elements  to  be  filled 
with  requests.  Thereafter,  the  RE  line  is  active  filling  the  one  available  cache  element 
followed  by  processing  time  for  a  memory  request  and  then  an  output  signaled  by  an 
active  RE  line.  This  pattern  is  repeated  until  the  simulation  is  completed.  Figure  VI.74 
contains  a  similar  plot  for  a  memory  system  with  32  banks.  The  primary  difference  is  that 
the  active  RE  lines  are  separated  by  32  cycles  rather  than  four  as  in  Figure  VI.73  because 
the  memory  ratios  are  matched  to  the  number  of  banks.  The  resulting  throughput  is  1/32 
or  approximately  0.03125  as  indicated  on  the  figure. 

This  result  presents  a  major  obstacle  to  the  architecture  described  in  Chapter  0 
because  one  such  pass  is  needed  for  each  FFT.  The  next  section  describes  the  results 
obtained  when  permutation-based  decoding  is  used  for  digit-reversed  address  patterns. 
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STM  Status  Throughput  Latency 


Plot  ID:  b2cn  #  Banks=32  #  CEs=3  Mem  Ratio=32 


S=0.9981  AvgTP=0.03119  MaxL=127  AvgL=126.8  StdL=3.618 


SSTP=0.03161  TR=1019 


Time  (Cycles) 


Figure  VL74  Detail  Simulation  Run  for  Radix=2  /  NoDigits=10  STM(32,3,32) 

(Conventional  Decoding) 


6.  Digit  Reversed:  Permutation-Based  Memory  Decoding 

As  indicated  in  Chapter  V,  the  digit  reversal  pattern  should  be  equivalent  to 
constant-stride  address  performance  if  the  radix  is  equal  to  or  greater  than  the  number  of 
banks.  In  the  case  of  radix-2  with  ten  digit  simulation,  the  condition  is  not  met  for  any  of 
the  simulations.  In  spite  of  this,  the  performance  in  this  instance  is  almost  perfect  as  can 
be  seen  by  viewing  the  steady-state  throughput  and  the  maximum  latencies  contained  in 
Figure  VI.75  and  Figure  VI.76. 

Figure  VI.77,  Figure  VI.78,  and  Figure  VI.79  contain  the  steady-state  throughput 
plots  for  radix-4,  8  and  16  respectively.  For  each  plot,  when  the  radix  is  equal  to  or 
greater  than  the  number  of  banks,  an  optimal  steady-state  throughput  is  obtained,  as 
predicted  in  Chapter  V  (i.e.,  for  four  banks  in  Figure  VI.77,  four  and  eight  banks  in 
Figure  VI.78,  and  four,  eight,  and  sixteen  banks  in  Figure  VI.79).  When  the  condition  is 
not  met,  good  performance  is  sometimes  obtained  anyway  (e.g.,  16  banks  in  Figure 
VI.77).  In  some  instances  poor  performance  is  improved  substantially  by  adding  another 
cache  element  (e.g.,  eight  banks  in  Figure  VI.77  and  32  banks  in  Figure  VI.78). 

In  summary,  the  permutation-based  digit-reversed  simulation  results  confirmed 
the  analysis  described  in  Chapter  V.  In  particular,  when  the  radix  is  equal  to  or  greater 
than  the  number  of  banks,  performance  is  consistent  with  constant-stride  address  patterns. 
When  it  is  not,  the  performance  is  mixed  but  over  all  it  provides  performance  that  may  be 
acceptable,  given  this  address  pass  is  required  only  once  for  each  FFT. 
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Figure  VI.75  Comparison  of  Theoretical  Versus  Simulated  Steady-State 
Throughput  for  Radix=2  /  NoDigits=10  (Permutation-Based  Decoding) 


c. 


GENERAL-PURPOSE  COMPUTING  EXPERIMENT 


The  speedup,  throughput,  and  latency  plots  for  the  general-purpose  computer 
experiment  are  shown  in  Figure  VL80  through  Figure  VL82.  Adding  cache  elements 
increases  both  the  speedup  and  throughput.  However,  this  is  accompanied  by  much 
larger  latencies  for  the  simulation  runs  with  a  larger  number  of  banks.  For  four  banks, 
speedup  increases  by  382  percent  from  standard  interleaving  simulation  to  the  STM 
simulation  with  64  cache  elements.  However,  294  percent  of  this  improvement  was 
obtained  when  the  number  of  cache  elements  was  increased  to  only  four.  The  64-bank 
simulations  recorded  a  similar  trend  with  406  percent  total  improvement  and  241  percent 
obtained  with  four  cache  elements  from  the  standard  interleaving  case. 

Notice  that  although  the  speedup  continues  to  improve  when  cache  elements  and 
the  number  of  banks  are  increased,  the  throughput  actually  falls  as  the  number  of  banks 
increases.  This  is  because  the  memory  ratio  is  matched  to  the  number  of  banks  and  the 
difficulty  of  the  problem  increases  proportionally  as  the  number  of  banks  increases. 

Recall  the  relationship  between  speedup,  throughput,  and  the  memory  ratio  as  shown  in 
Equation  (II.7). 

A  comparison  of  the  standard  interleaving  case  to  the  analytical  results  is  shown 
in  Table  VI.5.  Although  the  simulated  results  correlate  with  the  analytic  results,  there  is  a 
constant  bias  of  approximately  one  for  each  value. 

In  summary,  the  general-purpose  computing  simulation  suggests  that  speedup  and 
throughput  are  enhanced  by  adding  cache  elements.  Diminishing  marginal  returns  is 
observed  when  only  a  few  cache  elements  are  added  to  the  standard  interleaving  case. 


Number  of 
Banks 

RO-56 

Simulated 

Results 

4 

2.2 

1.2 

8 

3.2 

2.2 

16 

4.7 

3.7 

32 

7.0 

5.8 

Table  VL5  Comparison  of  Analytic 
Versus  Simulated  Speedup 
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VII.  CONCLUSIONS 

This  section  first  summarizes  the  design  decisions  for  the  Butterfly  Machine 
(BFM)  Architecture  followed  by  a  design  methodology  for  constructing  this  type  of 
computer.  The  last  section  contains  additional  conclusions  concerning  this  effort. 

A.  DESIGN  DECISIONS 

The  computational  complexity  of  cyclostationary  processing,  combined  with  the 
requirement  to  design  to  a  factor  of  real  time  Fj  and  sample  interval  ,  drives  the  need 
for  a  scaleable  number  of  processors  in  the  architecture.  The  BFM  Architecture  is  based 
on  pipelined  vector  processing  techniques  because  it  yields  an  efficient  implementation 
for  FFTs  in  particular,  and  vector  operations  in  general. 

Radix-2*  algorithms  were  selected  based  on  the  availability  of  efficient 
implementations  in  hardware  and  their  widespread  use  and  popularity.  The  radix  values 
supported  are  two,  four,  eight,  and  sixteen. 

The  number  of  memory  banks  allowed  in  the  architecture  is  a  power  of  two.  This 
constraint  simplifies  the  bank  number  selection  hardware  although  2*  ±  1  bank 
architectures  are  almost  competitive. 

The  use  of  radix  values  and  number  of  banks  that  are  both  powers  of  two  require 
an  alternative  to  conventional  bank  number  decoding.  Properly  designed  permutation- 
based  bank  decoding  provides  an  efficient  utilization  of  the  memory. 

Another  design  decision  is  programmable  permutation  matrices.  This  provides 
for  flexibility  in  general.  The  primary  motivation  is  to  enhance  performance  by  allowing 
radix-r  specific  matrices.  This  design  decision  is  closely  related  to  two  additional  design 
decisions: 

•  The  decision  not  to  use  a  specialized  permutation  matrix  for  the  radix-2 
butterfly. 

•  The  decision  to  require  that  all  specialized  radix-r  butterfly  matrices  also 
support  constant  strides  of  powers  of  two. 
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The  computation  of  an  FFT  on  an  input  vector  of  length  2*  requires  that  a 
decision  be  made  concerning  the  size  of  the  radices  and  order  in  which  the  different  radix 
operations  are  applied.  The  strategy  taken  for  this  architecture  is  to  rewrite  the  length  of 
the  input  vector  as 

2^  =R'^R  (VII.1) 

where  R  is  the  largest  valid  radix  for  the  length  of  the  input  vector.  The  largest  value  of 
m  is  chosen  such  that 


2^  >  and 
R>R. 


(VIL2) 


Therefore,  m  radix-i?  butterfly  passes  will  be  made  on  the  input  vector  followed  by  at 
most  one  radix  -  R  butterfly  pass. 

An  inspection  of  Figure  III.9  reveals  that  the  memory  that  initially  holds  the  input 
vector  must  be  accessed  by  an  address  pattern  of  constant  stride  of  one  followed  by  a 
radix-i?  pattern.  Therefore,  this  memory  will  be  loaded  with  the  appropriate  radix-i? 
permutation  matrix.  Each  additional  pass  is  characterized  by  a  memory  that  will  accept  a 
set  of  inputs  with  a  constant  stride  of  one  followed  by  read  operation  with  a  radix-r 
butterfly  address  pattern.  The  corresponding  memory  will  use  the  appropriate  radix-r 
permutation  matrix.  The  appropriate  matrix  is  the  radix-i?  permutation  matrix  for  all 
passes  with  the  possible  exception  of  the  last  pass  which  will  use  the  radix-i? 
permutation  matrix  if  i?  exists  for  the  decomposition  of  Equation  (VII.l). 

The  right-most  memory  in  Figure  (111.9)  is  written  into  with  a  constant  stride  of 
one.  The  data  is  read  out  with  a  digit-reversed  pattern.  In  those  cases  where  the  vector 
length  is  such  that  the  decomposition  of  Equation  (VII.l)  does  not  contain  the  factor  i? , 
then  the  required  addressing  pattern  is  a  digit-reversed  address  pattern.  If  on  the  other 
hand,  there  is  a  factor  of  R  in  Equation  (VII.l),  then  the  address  pattern  is  not  strictly 
digit-reversed.  However,  the  address  pattern  does  have  a  characteristic  of  a  constant 
stride  of  a  power  of  two.  In  either  case,  if  the  radix  is  greater  than  the  number  of  banks, 
then  the  performance  is  near  optimum. 
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When  the  radix  is  less  than  the  number  of  bank,  the  simulation  results  suggest  that 
the  steady-state  throughput  is  near  optimal  when  four  cache  elements  are  used  for  those 
cases  examined  in  Chapter  VI  (See  Figure  VL77  and  Figure  VI.78).  These  address 
patterns  must  be  simulated  in  any  final  design  to  verify  performance  and  to  adjust  the 
permutation  matrices  if  the  performance  is  not  acceptable. 

The  permutation  matrices  used  for  the  digit-reversed  pattern  experiments  were  the 
constant  stride  powers  of  two  matrices.  A  future  research  topic  is  to  determine  whether  a 
tailored  permutation  matrix  can  be  found  for  the  digit  reversed  case. 

B.  STM  DESIGN  METHODOLOGY 

The  design  methodology  for  developing  an  STM  memory  begins  with  the 
processor  and  bulk  store  memory  cycle  times  desired  for  the  architecture.  The  ratio  of  the 
bulk  store  cycle  time  to  the  processor  cycle  time  is  the  memory  ratio,  one  of  the  three 
parameters  necessary  for  STM  memory.  The  memory  ratio  can  be  expressed  as 

MR  = 

where 

Tbs  is  the  cycle  time  for  the  bulk  store,  and 
is  the  cycle  time  for  the  processor. 

The  ceiling  function  must  be  taken  on  the  bulk  store  /  processor  cycle  time  ratio  to  yield 
an  integer  value  that  will  permit  the  memory  to  process  a  memory  request. 

The  memory  ratio  dictates  the  number  of  banks  required  for  the  memory  system. 
The  number  of  banks  is  required  to  be  a  power  of  two  in  this  design  for  simplicity  of  bank 
selection  and  must  be  greater  than  or  equal  to  the  memory  ratio. 

The  results  in  Chapter  VI  suggest  that  overall  performance  is  constrained  by  the 
maximum  latency  and  the  maximum  latency  is  approximately  twice  the  memory  ratio 
when  permutation  matrices  are  utilized  for  bank  decoding.  Since  the  memory  ratio  is  tied 
directly  to  the  number  of  banks,  there  is  motivation  to  minimize  the  number  of  banks. 


Pr 


(VII.3) 
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Once  the  number  of  banks  has  been  selected,  the  last  parameter  to  fix  is  the 
number  of  cache  elements.  Based  on  the  results  of  Chapter  VI,  the  number  of  cache 
elements  is  likely  to  be  not  less  than  four.  However,  cache  elements  are  relatively 
inexpensive,  assuming  that  they  are  implemented  with  very  large  scale  integration.  The 
actual  number  chosen  is  likely  the  largest  number  possible  within  the  economic  bounds  of 
the  fabrication  process. 

Programmable  permutation  matrices  allow  the  incorporation  of  performance 
enhancements  when  more  advanced  permutation  matrices  are  discovered.  In  some 
circumstances,  the  performance  of  these  matrices  may  be  dependent  upon  more  cache 
elements  than  was  previously  required. 

The  last  step  of  the  design  process  it  to  construct  permutation  matrices  for  the 
architecture.  Although  there  are  many  possible  addressing  patterns,  there  are  a  relatively 
small  number  when  compared  to  general-purpose  computing.  All,  or  a  selected  set,  can 
be  simulated  to  verify  the  anticipated  performance.  The  number  of  cache  elements  can  be 
varied  for  sensitivity  analysis.  Permutation  matrices  may  also  be  fine  tuned  to  improve 
performance. 

C.  GENERAL  CONCLUSIONS 

The  preceding  chapters  describe  a  pipelined  vector  computer  architecture 
designed  to  compute  fast  Fourier  transforms  (FFTs)  efficiently.  Other  vector  processing 
operations  such  as  vector  multiplication  are  also  well  suited  for  this  architecture.  Use  of 
the  constant  geometry  radix  butterfly  organization  is  a  key  design  decision  providing 
simplification  in  the  address  stream  generation  for  radix-r  passes. 

The  memory  system  is  the  key  component  of  a  vector  processor  architecture. 
Addressing  stream  characteristics  for  general-purpose  and  vector  processors  are  described 
in  Chapter  n.  Banked  interleaved  memory  remains  the  technique  of  choice  for  vector 
processors  because  of  the  high-performance  requirements  and  the  promise  of  exploiting 
the  constant-stride  address  stream  characteristic.  This  architecture  is  based  on  the 
requirement  that  data  be  fed  into  a  vector  processor  at  the  rate  of  one  data  element  per 
clock  cycle  for  each  vector.  The  constant-stride  address  stream  characteristic  is  exploited 
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through  the  use  of  specially  designed  permutation  matrices  used  for  bank  number 
decoding. 

The  performance  of  STM  memories  using  both  conventional  and  permutation- 
based  matrices  was  analyzed  in  Chapter  V  and  Chapter  VI.  The  preferred  bank  decoding 
scheme  was  with  permutation  matrices,  based  on  performance.  The  results  of  Chapter  VI 
indicate  that  optimum  steady-state  throughput  is  possible  in  all  cases  for  constant-stride 
address  patterns  with  a  stride  that  is  a  power  of  two,  as  well  as  for  radix-r  butterfly 
patterns  using  tailored  permutation  matrices.  In  fact,  both  of  these  cases  yield  an  upper 
bound  of  twice  the  memory  ratio  plus  one.  This  is  excellent  given  that  the  minimum 
latency  for  any  interleaved  system  is  the  memory  ratio  plus  two!  The  other  address 
pattern,  digit-reversed  addressing,  also  yields  the  same  performance  as  indicated  above 
for  constant-stride  and  radix-r  butterfly  addressing  when  the  radix  is  greater  than  or  equal 
to  the  number  of  banks.  When  it  is  not,  the  actual  performance  is  in  some  instances 
similar  to  that  noted  above,  and  in  others  is  somewhat  less.  These  cases  need  to  be 
simulated  to  determine  the  specific  performance  characteristics.  One  possible  area  of 
study  is  to  determine  whether  permutation  matrices  can  be  designed  specifically  for  digit- 
reversed  patterns  and  still  retain  their  suitability  for  constant  stride  and  radix-r  address 
patterns. 

The  following  is  a  list  of  further  conclusions  concerning  the  butterfly  machine 
architecture  and  the  STM  memory  described  previously: 

•  The  use  of  BFMs  provide  a  practical  method  for  reducing  the  clock  time 
needed  for  cyclostationary  computing.  The  amount  of  reduction  is  variable 
and  is  determined  by  the  degree  to  which  parallelism  is  exploited. 

•  The  use  of  BFMs  is  scaleable  over  a  substantial  processing  range  and  is 
limited  by  the  number  of  backplane  slots  supported  by  the  host.  An 
architecture  using  a  single  BFM  chip  is  first  described  which  provides  a 
baseline  capability.  Due  to  the  parallelism  inherent  in  many  cyclostationary 
algorithms,  a  natural  extension  is  to  develop  an  architecture  that  incorporates 


217 


multiple  copies  of  the  one-chip  architecture  and  connect  them  with  dedicated 
high-speed  data  busses  for  data  sharing. 

•  The  BFM  architecture  requires  large  quantities  of  memory.  This  memory 
requirement  can  be  accommodated  using  relatively  slow  low-cost  bulk 
memory  devices.  In  this  investigation,  each  addressing  stream  had  a 
dedicated  memory.  One  area  of  future  study  is  to  determine  if  it  is  more 
effective  to  constmct  fewer  larger  memories  than  the  configuration  shown  in 
Figure  HI.  17. 

•  A  good  design  requires  that  the  number  of  banks  be  greater  than  or  equal  to 
the  memory  ratio.  With  the  appropriate  permutation  matrix,  the  number  of 
banks  need  not  be  greater  than  the  memory  ratio.  Further,  the  number  of 
cache  elements  can  be  limited  to  approximately  four  in  most  circumstances. 

•  STM  is  an  effective  technique  for  using  relatively  slow,  inexpensive,  bulk 
storage  with  the  BFM  architecture  when  the  array  lengths  are  large,  relative  to 
the  latency. 
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Appendix  Matlab^  Source  Code  for  STM  Simulator 


%  File  Name:  stm.m 

%  Description:  Top  level  driver  for  split  transaction  memory 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  27  Oct  95 

% 

%  Comments: 

%  3/08:  Empty  modified  to  be  a  SF  variable  rather 

%  than  a  variable! 

%  3/13:  Slight  cleanup  of  comments 

%  3/1 6:  Add  event  flags  to  catch  activity  to/from 

%  DRAM  as  is  done  to/from  CPU. 

%  3/26:  Modify  to  accept  only  an  address.  Bank  # 

%  is  computed  in  gen_addr() 

%  4/14:  Modify  to  measure  latency  from  the  point  of 

%  of  issue  by  the  processor 

%  4/22:  Modify  to  allow  both  ASCII  and  binary  output 

%  4/23:  Performance  enhancements  (init_rec) 

%  10/27:  Add  PB  bank  selection 

% 

%  function  []  = 

%  stm{Fname, ASCII, Level, AList,NoBanks,NoCE,MemRatio,MemDecode, A) 
%  where 


% 

Fname 

File  name  for  saved  data 

% 

ASCII 

Determines  the  format  of  the  output  file 

% 

Level 

Determines  the  level  of  detail  of  ouput  saved  in 

% 

Fname. 

% 

% 

% 

% 

AList 

Address  List.  This  is  a  matrix.  Each  row 
is  of  the  form:  [Address  Bank#  RW] 

NoBanks 

Number  of  banks  to  be  used  in  the  simulation 

% 

NoCE 

Number  of  Cache  Elements  to  be  used  in  the 

% 

simulation 

% 

Mem  Ratio 

Ratio  of  Dynamic  to  Static  memory  cycle  time 

% 

MemDecode 

% 

0  -  Conventional  decoding 

% 

1  -  PB  decoding  using  matrix  A 

% 

A 

PB  decoding  matrix 
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function  []  = ... 

stm(Fname, ASCII, Level, AList,NoBanks,NoCE,MemRatio,MemDecode, A) 

%  Check  input  arguments 
if  ((MemDecode==0)  &  ((nargin<8)l(nargin>9)) ), 
fprintf(1, 'Input  Farm  Error  1\n’); 
exit(-1); 

end; 

if  ((MemDecode==1)  &  (nargin~=9)), 
fprintf(1, 'Input  Farm  Error  2\n'); 
exit(-1); 

end; 

if  ((NoCE<1)  I  (MemDecode<0)  I  (MemDecode>1)  I  (NoBanks<1)  I  ... 
(LevekO)  I  (Level>2)  I  (ASCIkO)  I  (ASCII>1)  I  ... 

((Level==2)&(ASCII==0)) ) , 

fprintf(1,'lnput  Farm  Error:  3\n'); 
exit(-1); 

end; 

if  (MemDecode==1), 

ADim  =  size(A); 
if  (2^ADim(1)~=NoBanks), 

fprintf(1, 'Input  Farm  Error:  4\n'); 
exit(-1): 

end; 

clear  ADim 

end; 

%  If  Fermutation  based  decoding  is  chosen,  permute  the  addresses 
%  using  the  A  matrix 
if  (MemDecode==1), 

Addr  =  AList; 

[ResultVect,NoDigits]  =  pb_int(Addr,  A,  0); 

AList(:,1)  =  ResuItVect'; 

end; 

%%%  Farameter  initialization  %%% 

%  Simulation  Farameters 
SysClk  =  1 ; 

Curind  =  1; 

%%%  These  variables  are  used  for  data  collection  %%% 

MemResp  =  zeros(1 ,2);  %  1  st  variable  is  Boolean 

%  1 -response  occu red; 

%  0-response  did  not  occur. 

%  2nd  variable  indicates  Bank  responding 
ReqAllowed  =  zeros(1 ,3);  %  1  st  variable  is  Boolean. 

%  1 -request  was  allowed; 

%  0-request  was  not  allowed. 
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%  2nd  variable  indicates  Bank  responding 
%  3rd  variable  indicates  address 


O/  O/  O/  O/  O/  O/  O/  O/  O/  O/  O/  O/  O/  O/  O/  0/  O/  0/  O/  O/  0/  0/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/ 
/o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o 


LastAddr  =  0; 

%%%  Key  Parameters  %%% 
%NoBanks 


%NoCE 
%MemRatio 
ReqCount  =  MemRatio; 

%%%  NoCE  Adjustment  for  effective  NoCE  %%% 

NoCE  =  NoCE  +  1 ; 

%%%  Bank  Variables  %%% 

%  Note  that  each  variable  is  two  dimensional;  the  first  variable  is  used 
%  to  specify  an  element  within  an  array  (e.g.,  Cache  variables).  The 
%  second  index  is  used  to  specify  the  bank  number. 

% 


%%%  Cache  Array  Elements  %%% 
Index  =  zeros(NoCE,NoBanks); 
IndexN  =  Index; 

Address  =  zeros(NoCE,NoBanks); 
AddressN  =  Address; 

RW  =  zeros(NoCE,NoBanks); 
RWN=RW; 

Ready  =  zeros(NoCE,NoBanks); 
ReadyN=Ready; 

Data  =  zeros(NoCE,NoBanks); 
DataN=Data; 


%%%  Counters  %%% 

NAC  =  ones{NoBanks,1); 
NACN=NAC; 

CPC  =  ones(NoBanks,1); 
CPCN=CPC; 

OC  =  ones(NoBanks,1); 
0CN=0C; 

□Count  =  zeros(NoBanks,1); 
DCountN=DCount; 


%%%  Flags  %%% 

Empty  =  ones(NoBanks,1); 
PDC  =  zeros(NoBanks,1); 
PDCN=PDC; 
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%%%  Signals  %%% 

GRI  =  ones(NoBanks,1);  %  Initially  all  TRUE 
GR  =  0: 

REI  =  zeros(NoBanks,1);  %  Initially  all  FALSE 
RE  =  0; 

BS  =  zeros(NoBanks,1): 

%%%  Global  Counters  %%% 

ReqC  =  zeros(NoBanks,1): 

ReqCN=ReqC: 

ResC  =  zeros(NoBanks,1); 

ResCN=ResC; 

ODataLen  =  length(AList)*2; 

OData  =  zeros(ODataLen,9); 


0/  o/  0/  o/  Of  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/  o/ 
/o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o 

%%%%%%%%%%%%  Program  Begins  Here  %%%%%%%%%%% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%  Initialize  Save  File 


if  ASCII, 

ficl=init_sf(Fname,NoBanks,NoCE,MemRatio, Level); 

end; 

done  =  0; 

fprintf(1 , 'Simulation  Begins  ...  \n'); 
fprintf(1,'#  Banks:  %d\n',NoBanks); 
fprintf(1  ,'#  Cache  Elements:  %d\n',NoCE-1 ); 
fprintf(1,'#  Memory  Ratio:  %d\n',MemRatio); 
fprintf(1  ,'#  Memory  References:  %d\n',length(AList)); 

NIog  =  log10(length(AList)); 
if  (NIog  <=2), 

DelMark  =  1 ; 

fprintf(1  ,'Each  tic  is  1  cycle\n\n'); 
elseif  (NIog  <=3) 

DelMark  =  10; 

fprintf(1  ,'Each  tic  is  10  cycles\n\n‘); 

else 

DelMark  =  100; 

fprintf(1  ,'Each  tic  is  100  cycles\n\n’); 

end; 

Mark  =  1 ; 
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while  -done, 

if  (Mark  >=  DelMark), 
fprintf(1,'.'); 

Mark  =  1 ; 

else 

Mark  =  Mark  +  1 ; 

end; 

if  (rem(SysClk,50*DelMark)==0),  fprintf(1,'\n'); 
end; 

GRI  =  eval_gri(NoBanks,NAC,OC, Empty, NoCE); 

REI  =  evaLrei(REI,ResC, Index, OC,Ready); 

Empty  =  ev_empty(NAC,OC,CPC,NoBanks); 

[Addr,BankSelNo,WRFIag,LastAddr,Curlnd] ... 

=  gen_addr(Curlnd,LastAddr,AList,NoBanks,GRI); 

%  Initialize  recording  variables  for  a  time  slice 
[MemResp  ReqAllowed  DRAMResp  DRAMIssued] ... 

=  init_rec(NoBanks,Addr); 
for  BankNo  =  1:NoBanks, 

%%%  Respond  to  Memory  Read  %%% 

[OCN,ResCN, MemResp, OutData]= ... 

mem_resp(lndex,RW,Ready,Data,NAC,CPC,  ... 

OC, REI, ResC,MemResp, BankNo, NoCE,  ... 
OCN,ResCN); 

%%%  Service  Dynamic  Memory  %%% 
[ReadyN,DataN,CPCN,DCountN,PDCN,DRAMResp,DRAMIssued]= 
ser_dmem(Address,RW, Ready, Data,NAC,CPC,OC, ... 
DCount,PDC,BankNo,ReqCount,NoCE, ... 
ReadyN,DataN,CPCN,DCountN,PDCN,  ... 

DRAMResp, DRAMIssued); 

%%%  Service  Memory  Request  %%% 
if  BankSeINo  >=0, 

[lndexN,AddressN,RWN,ReadyN,DataN,NACN,ReqCN,ReqAllowed]= 
ser_memr(lndex,Address,RW,Ready,Data,NAC,CPC,OC,GRI,... 
BS.ReqC, ReqAllowed, BankNo, Addr,BankSelNo,WRFIag, NoCE, ... 
lndexN,AddressN,RWN,ReadyN,DataN,NACN,ReqCN); 
end;  %  if  BankSeINo 
end;  %for 

lndex=lndexN;  Address=AddressN;  RW=RWN; 

Data=DataN;  Ready=ReadyN; 

NAC=NACN;  CPC=CPCN;  OC=OCN;  DCount=DCountN; 

PDC=PDCN;  ReqC=ReqCN;  ResC=ResCN; 
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%  Evaluate  the  SFs  in  order  to  record  the  values  that  exist 
%  during  the  cycle.  It  also  causes  the  simulation  to 
%  complete  at  the  correct  time. 

GRI  =  eval_gri(NoBanks,NAC,OC, Empty, NoCE); 

REI  =  evaLrei(REI,ResC, Index, OC, Ready); 

Empty  =  ev_empty(NAC,OC,CPC,NoBanks): 
done  =  sim_comp(LastAddr,Empty); 

%  Save  Results 
if  ASCII, 

sav_res(lndex,Address,RW,Ready,Data,NAC,CPC, ... 

OC,DCount,Empty,PDC,GRI,REI,BS,ReqC,ResC,SysClk 

NoBanks,BankSelNo,WRFIag,NoCE,fid,Level,MemResp, 

ReqAllowed,DRAMResp,DRAMIssued,MemRatio); 

else, 

if  ReqAllowed(l); 

ADDR  =  Address(modulo1(NAC(ReqAllowed(2))-1,NoCE), ... 
ReqAllowed(2)); 

else 

ADDR  =  -1; 

end; 

if  MemResp(l); 

ADDR2  =  Address(modulo1(OC(MemResp(2))-1,NoCE),MemResp(2)) 
else 

ADDR2  =  -1; 

end; 

Epoch  =  [SysClk  BankSeINo  WRFlag  ReqAllowed(l) ... 

ReqAllowed(3)  ADDR  MemResp(l)  ADDR2  MemResp(2)]; 
OData(SysClk,1:9)  =  Epoch; 
if  ODataLen==SysClk, 

ODataLen  =  ODataLen*2; 

TData  =  OData; 

OData  =  zeros(ODataLen,9); 

OData(1  :SysClk,1  ;9)  =  TData; 

end; 

end; 

SysClk  =  SysClk  +1; 
end;  %while 


228 


if  ASCII, 

fclose(fid): 

else 

OData  =  OData(1  :SysClk-1 ,1:9); 
fname  =  [Fname,  ’.gri’]; 
fid  =  fopen(fname,'w‘): 

Tmp  =  [NoBanks  NoCE  MemRatio]; 

fwrite(fid,Tmp,'long'); 

fclose(fid); 

fname  =  [Fname,  '.gr2']; 
fid  =  fopen(fname,'w'); 
fwrite(fid,OData,'long'); 
fclose(fid); 

end; 

fprintf(1  ,‘\nTotal  Number  of  Cycles=  %d\n\n',SysClk-1); 
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%  File  Name:  ev_empty.m 

%  Description:  Evaluate  the  status  of  Empty  flags 

%  Internal  flag  within  all  banks. 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  07  Mar  95 

% 

%  function  Empty  =  ev_empty(Empty,NAC,OC,CPC) 

% 

%  where 


% 

% 

Empty 

Empty  flag 

% 

NAC 

Next  Available  Counter 

% 

OC 

Output  Counter 

% 

CPC 

Current  Processed  Counter 

% 

function  Empty  =  ev_empty(NAC,OC,CPC,NoBanks) 


for  i=1:NoBanks, 

Empty(i)  =  (NAC(i)==CPC(i))  &  (CPC(i)==OC(i)); 

end; 
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%  File  Name:  evaLgr.m 

%  Description:  Evaluate  the  status  of  the  Grant  Request 

%  control  line  based  on  the  values  of  the  Grant  Request 

%  Internal  controls  within  each  CE. 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  6  Feb  95 

% 

%  function  status  =  evaLgr(GRI) 

%  where 

%  status  TRUE  if  MR  active;  FALSE  otherwise 

%  GRI  Grant  Request  Internal 

% 

function  status  =  evaLgr(GRI) 

if  min(GRI)==0, 
status  =  0; 

else 

status  =  1 : 

end; 
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%  File  Name:  evaLgri.m 

%  Description:  Evaluate  the  status  of  the  Grant  Request 

%  Internal  signal  within  all  banks. 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  22  Feb  95 

% 

%  function  GRI  =  eval_gri(GRI,NAC,OC, Empty) 

%  where 


% 

GRI 

Grant  Request  Internal  lines  for  the  memory  banks. 

% 

1  -  indicates  that  bank  is  available; 

% 

0  -  indicates  that  bank  is  unavailable. 

% 

NAG 

Next  Available  Counter 

% 

OC 

Output  Counter 

% 

% 

Empty 

Empty  flag 

function  GRI  =  eval_gri(NoBanks,NAC,OC, Empty, NoCE) 
for  i=1:NoBanks, 

GRI(i)  =  (modulo1(NAC(i)+1,NoCE)~=OC(i))  I  (Empty(i)==1); 

end; 
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%  File  Name:  eval_rei.m 

%  Description:  Evaluate  the  status  of  the  Request  Enable 

%  Internal  signal  within  all  banks. 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  21  Feb  95 

% 

%  function  REI  =  evaLrei(REI,ResC, Index, OC, Ready); 

%  where 


% 

REI 

Request  Enable  Internal  lines  for  the  memory  banks. 

% 

1  -  indicates  bank  has  data  available; 

% 

0  -  indicates  bank  doesn't  have  data  available. 

% 

ResC 

Response  Counter 

% 

Index 

Processing  Index 

% 

OC 

Output  Counter 

% 

% 

Ready 

Ready  flag 

function  REI  =  evaLrei(REI,ResC, Index, OC, Ready) 

N  =  length(REI); 
fori=1:N, 

REI(i)  =  (ResC(i)==lndex(OC(i),i))  &  Ready(OC(i),i); 

end; 
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%  File  Name: 

%  Description: 

% 

% 

% 

%  Programmer: 
%  Date  Mod: 


% 


init_rec.m 

Initializes  the  recording  variables  prior  to  servicing 
a  bank.  The  recording  variables,  MemResp,  ReqAllowed, 
DRAMResp,  and  DRAMIssued  are  used  to  record 
simulation  events  and  are  not  a  part  of  the  simulation. 
Raymond  F.  Bernstein  Jr. 

23  Jun  95  Expanded  ReqAllowed  to  include  address 


%  Evaluate  GR  (Grant  Request) 

%  function  [MemResp,  ReqAllowed,  DRAMResp,  DRAMIssued]  = 
%  init_rec(NoBanks) 

%  where 


% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 

% 


% 

% 


MemResp  Two  field  variable  used  to  record  the  memory  response 

First  Field:  Boolean  indicating  whether  a  memory  response 
occurred 

Second  field:  Bank  number  of  the  responding  field 
ReqAllowed  Three  field  variable  used  to  record  whether  a  memory 
request  was  permitted 

First  Field:  Boolean  indicating  whether  a  memory  request 
occurred 

Second  field:  Bank  number  of  the  responding  field 
Third  Field:  Memory  Address 

DRAMResp  Boolean  indicating  the  bulk  store  responded  in  the  cycle 
DRAMIssued  Boolean  indicating  the  bulk  store  was  issued  during  the  % 
cycle 

NoBanks  Number  of  banks  for  the  memory  to  be  simulated 


function  [MemResp,  ReqAllowed,  DRAMResp,  DRAMIssued]  = 

init_rec(NoBanks,Addr) 


MemResp  =  [0  -1]; 

ReqAllowed  =  [0  -1  Addr]; 

DRAMResp  =  zeros(2,NoBanks); 

DRAMResp(1 ,1  :NoBanks)  =  zeros(1 , NoBanks); 
DRAMResp(2,1:NoBanks)  =  -1*ones(1, NoBanks); 

DRAMIssued  =  zeros(3,NoBanks); 

DRAMIssued(1 ,1  :NoBanks)  =  zeros(1 , NoBanks); 
DRAMIssued(2,1:NoBanks)  =  -1*ones(1, NoBanks); 
DRAMIssued(3,1:NoBanks)  =  -1*ones(1, NoBanks); 
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%  init_sf.m 
%  Initialize  Save  File 

%  function  fid=init_sf(fname,NoBanks,NoCE,MemRatio, Level) 

% 

%  where 
%  fid 

%  fname 

%  NoBanks 

%  NoCE 

%  Mem  Ratio 

%  Level 

% 

function  fid  =  init_sf(fname, NoBanks, NoCE, MemRatio, Level) 

fname  =  [fname,  '.gr']; 
fid  =  fopen(fname,'wt'); 


File  id  of  the  opened  save  file 

File  name  for  saved  data 

Number  of  Memory  Banks  in  the  simulation 

Number  of  Cache  Elements  in  the  simulation 

Ratio  of  dynamic  to  static  memory  cycle 

Level  of  detail  to  save  for  analysis 


if  Level==0; 

fprintf(fid, 'Number  of  Banks:  %s  ',num2str(NoBanks)): 
fprintf(fid, 'Number  of  Cache  Elements:  %s\n',num2str(NoCE-1)); 
fprintf (fid, 'Dynamic/Static  Mem  Cycle  Time:  %s\n\n',... 

num2str(MemRatio)): 

elseif  Level==1 ; 

fprintf(fid, 'Number  of  Banks:  %s\n',num2str(NoBanks)); 
fprintf(fid, 'Number  of  Cache  Elements:  %s\n',num2str(NoCE-1)): 
fprintf(fid, 'Dynamic/Static  Mem  Cycle  Time:  %s\n\n',... 

num2str(MemRatio)): 

fprintf(fid,'ClkBank#  WR  ReqAllowed  MemResp  Bank#\n'); 
elseif  Level==2; 

fprintf(fid,'%s\n',num2str(NoBanks)); 

fprintf(fid,'%s\n',num2str(NoCE-1)); 

fprintf(fid,'%s\n',num2str(MemRatio)); 

end; 
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%  File  Name: 

%  Description: 
%  appropriate. 
%  Programmer: 
%  Date  Mod: 

% 


mem_resp.m 

Evaluate  and  process  a  memory  response  if 

Raymond  F.  Bernstein  Jr. 

07  Mar  95 


%  Comments:  3/7:  Empty  modified  to  be  a  SF  variable  rather 
%  than  a  flag. 

% 


%  function  [Index, Address, RW, Ready, Data, OC, Empty, REI,ResC]=  ... 

%  mem_resp(lndex,Address,RW,Ready,Data,NAC,CPC, ... 

%  OC,Empty,REI,RE,ResC,BankNo,NoCE) 

% 

%  See  definitions  in  Chapter  V,  Section  B,  Subsection  1)  for  definitions 

% 


function  [OCN,ResCN,MemResp,OutData]=  ... 

mem_resp(lndex,RW,Ready,Data,NAC,CPC, ... 

OC,REI,ResC,MemResp,BankNo,NoCE,... 

OCN.ResCN) 


OutData  =  -1 ; 

RE  =  max(REI); 
if  (REI(BankNo)==1), 

OutData=Data(OC(BankNo),BankNo): 
ResCN(BankNo)  =  ResC(BankNo)  +  1; 
OCN(BankNo)  =  modulo1(OC(BankNo)+1,NoCE); 
MemResp  =  [1  BankNo]; 
elseif  (RE==1) 

ResCN(BankNo)  =  ResC(BankNo)  +  1; 

end; 
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%  File  Name:  modulol.m 

%  Description:  Performs  the  remainder  operation  on  two  numbers 

%  but  the  result  is  mapped  to  1  ..K  for  modulol  (x,k) 

%  Programmer:  Raymond  F.  Berntsein  Jr. 

%  Date  Mod:  6  Feb  95 

% 

%  function  result  =  modulol  (x,k) 

%  where 

%  k  Is  the  modulus  number. 

%  X  Is  the  number  to  be  acted  upon. 

% 

function  result  =  modulol  (x,k) 

result  =  rem(x,k); 
if  result==0, 

result  =  k; 
elseif  result<0, 
done  =  0; 
while  -done, 

result  =  result  +  k; 
done  =  result>0: 
end;  %while 

end; 
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%  File  Name: 

%  Description: 

% 

% 

%  Programmer: 
%  Date  Mod: 

%  Notes: 

% 


pbjnt.m 

generates  a  bank  selection  sequence  based  on  a  PB 
matrix  A 

Raymond  F.  Berntsein  Jr. 

Oct  95 


%  function  [ResuItVect.NoDigits]  =  pbJnt(Addr,  A,  fname) 
% 

% 


% 

% 

% 

% 

% 

% 


where: 

ResuItVect 

NoDigits 

Addr 


% 


% 


A 

fname 


Output  vector  of  permuted  bank  numbers 
Number  of  digits  in  address  pattern.  The  return  value 
Input  Address  stream 

will  be  the  number  of  bits  required  to  represent  the 
largest  number  in  Addr. 

Permutation  matrix 

Name  of  the  file  to  store  the  resulting  bank  selection 
patterns. 


function  [ResuItVect, NoDigits]  =  pbJnt(Addr,  A,  fname) 


if  ((nargin~=3)), 

fprintf(1 , 'Invalid  parameters  for  PB  conversion  type\n'): 
exit(-1): 

end; 

%  Assuming  working  with  base  2 
B  =  2; 

%  Make  it  a  column  matrix 
s  =  size(Addr); 
if  s(2)>s(1), 

Addr  =  Addr'; 

end; 

%  Make  it  addresses  only  (i.e.,  no  read/write  into) 
ifs(2)~=1, 

Addr  =  Addr(:,1); 

end; 

maxAddr  =  max(Addr); 

Count  =  ceil(log10{maxAddr)/log10(2)); 

NoDigits  =  Count; 

M  =  length  (Addr); 
done  =  0; 
i=  1; 
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while  -done, 

bm(:,i)  =  rem(Addr,2); 

Addr  =  fix(Addr/2); 
if  i==Count, 

done  =  1; 

else, 

i=i+1; 

end; 

end; 

bm  =  fliplr(bm); 

%  Binary  version  of  addr  (bm)  is  complete.  Now  use  only  the  k  LSBs  to 
%  compute  the  bank  (i.e.,  k  is  the  number  of  columns  in  A 
As  =  size(A); 
bms  =  size(bm); 

NoColA  =  As(2); 

NoColbm  =  bms(2); 
if  (bms(2)>As(2)), 

bm  =  bm(:,NoColbm-NoColA+1:NoColbm); 
elseif  (bms(2)<As(2)), 

A  =  A(:,NoColA-NoColbm+1  :NoColA); 

end; 

ResultVect  =  A*bm'; 

ResultVect  =  rem(ResultVect',2); 

%  Create  a  Powers  matrix 
PowerVect  =  ones(size(ResultVect)); 
s  =  size(ResultVect); 
for  i=0:s(2)-1, 

PowerVect(:,s(2)-i)  =  PowerVect(;,s(2)-i)*(B^i); 

end; 

ResultVect  =  ResultVect.*PowerVect; 

ResultVect  =  sum(ResultVect'); 

if  fname~=0, 

fid  =  fopen([fname,'.bks'],'wt‘); 
f  printf  (f  id ,  '%d  \n',  ResultVect') ; 
fclose(fid); 

end; 
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%  File  Name:  sav_res.m 

%  Description:  Save  the  results  of  the  pass  for  analysis. 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  14Jun95 

% 

%  Comments:  3/1 3:  Provide  for  2  cycles  per  page  for  full  dump. 

%  6/14:  Add  address  issued  to  calculate  all  latency 

%  Level  2  only 

% 

%  sav_res(lndex,Address,RW, Ready, Data, NAG, CPC,  ... 

%  OC,DCount,Empty,PDC,GRI,REI,BS,ReqC,ResC); 

%  where 

% 

%  See  definitions  in  Chapter  V,  Section  B,  Subsection  1)  for  definitions 

% 

function  sav_res(lndex,Address,RW,Ready,Data,NAC,CPC,  ... 

OC,DCount,Empty,PDC,GRI,REI,BS,ReqC,ResC,SysClk,... 

NoBanks,BankSelNo,RWFIag,NoCE,fid,Level,MemResp,... 

ReqAllowed,DRAMResp,DRAMIssued,MemRatio) 


\n',  SysClk); 


if  Level  ==  0;  %  Full  Dump 

fprintf(fid, ******************  Clk=%3d  ******************\n'^  SysClk); 
for  k=1:NoBanks, 

fprintf(fid,'Bank#:  %3d\n',k); 
fprintf(fid,'***Cache  Element  Contents***\n'); 
fprintf(fid,’No  Index  AddrRW  Rdy  Data\n'); 
for  m=1  :NoCE, 

%%  fprintf(1  ,'m=  %d\n’,m); 

fprintf(fid,'%4d  %5d  %5d  %2d  %3d  %8d', ... 

m,  lndex(m,k),  Address(m,k),  RW(m,k),  Ready(m,k), .. 
Data(m,k)); 

if  (ReqAllowed(1)==1) ... 

&  (ReqAllowed(2)==k) ... 

&  (modulo1(NAC(k)-1,NoCE)==m), 
fprintf(fid,'  <--CPU\n'); 
elseif  (MemResp(1)==1) ... 

&  (MemResp(2)==k) ... 

&  (modulo1(OC(k)-1,NoCE)==m), 
fprintf(fid,’  ->CPU\n'); 
elseif  (DRAMResp(1,k)==1)  ... 

&  (DRAMResp(2,k)==k) ... 

&  (modulo1(CPC(k)-1,NoCE)==m), 
fprintf(fid,'  <--DRAM\n'); 
elseif  (DRAMIssued(1,k)==1) ... 

&  (DRAMIssued(2,k)==k) ... 

&  (modulo1(CPC(k),NoCE)==m), 
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fprintf(fid,'  -->DRAM\n’); 
else  fprintf(fid,'\n'); 
end;  %  if ..  elseif 
end;  %  for  m=1 
fprintf(fid,'NAC=  %d',NAC(k)); 
fprintf(fid,’OC  =  %d',OC(k)); 
fprintf(fid,'CPC=  %d',CPC(k)); 
fprintf(fid,'DCount=  %d\n',DCount(k)); 
fprintf(fid,'Empty=  %d',Empty(k)); 
fprintf(fid,'PDC=  %d',PDC(k)); 
fprintf(fid,'GRI=  %d',GRI(k)); 
fprintf(fid;REI=  %d\n',REl(k)); 
fprintf(fid,'ReqC=  %d',ReqC(k)); 
fprintf(fid,'ResC=  %d\n\n',ResC(k)); 
end;  %for  k=1 

fprintf(fid,‘ClkBank#  WR  MemResp  Bank#  ReqAllowed\n'); 

fprintf(fid,‘7o4d  ’,SysClk); 

fprintf(fid,'%5d  ’.BankSelNo); 

fprintf(fid;%2d  ’.RWFlag); 

fprintf(fid,'%7d  MemResp(1)); 

fprintf(fid,'%5d MemResp(2)); 

fprintf(fid,'7o10d  ReqAllowed(l)); 

if  ~(rem(SysClk,2)),  fprintf(fid,'\n\f); 

else,  fprintf(fid,'\n\n\n\n'); 

end; 

%end;  if  Level==0 


O/  O/  O/  O/  O/  O/  O/  O/  Of  0/  O/  O/  O/  O/  O/  0/  O/  O/  0/  0/  0/  O/  O/  O/  O/  O/  O/  0/  0/  o/  o/  o/  o/  o/ 
/o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o  /o 


elseif  Level  ==  1 ;  %  Validate  paper  studies 
fprintf(fid,'7o4d  ’.SysClk); 
fprintf(fid,'7o5d  ’.BankSelNo); 
fprintf(fid,'%2d  RWFlag); 
fprintf(fid,'%10d  ReqAllowed(l)); 
fprintf(fid,'7o7d MemResp(1)); 
fprintf(fid,'7o5d MemResp(2)); 
fprintf(fid,'\n'); 

7oend;  %  elseif  Level==1 
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%  Data  Analysis 


elseif  Level  ==  2; 

fprintf(fid,'%4d  '.SysClk); 
fprintf(fid,'%5d  '.BankSeINo); 
fprintf(fid,'%2d  ',RWFIag); 
fprintf(fid,'%10d  %5d  ',ReqAllowed(1),ReqAllowed(3)); 
if  ReqAllowed(l); 

fprintf(fid,'%4d  ... 

lndex(modulo1  (NAC(ReqAllowed(2))-1  ,NoCE),ReqAllowed(2))) 
else  fprintf(fid,'  -1'); 
end; 

fprintf(fid,‘%7d  MemResp(1)); 
if  MemResp(1); 

fprintf(fid,'%4d  ... 

index(modulo1  (OC(MemResp(2))-1  ,NoCE),MemResp(2))); 
else  fprintf(fid,'  -1'); 
end; 

fprintf(fid,'%5d  MemResp(2)); 
fprintf(fid,'\n'); 
end;  %  elseif  Level==2 


%  File  Name:  ser_dmem.m 

%  Description:  Service  dyamic  memory  within  a  bank 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  08  Mar  95 

%  Comments:  3/8:  modified  to  require  ReqCount  cycles  to  complete  a 

%  DRAM  cycle  rather  than  ReqCount+1  cycles. 

%  function  [ReadyN,DataN,CPCN,DCountN,PDCN,DRAMResp,DRAMIssued] 
%  ser_dmem(Address,RW, Ready, Data, NAC, CPC, OC, ... 

%  DCount,PDC,BankNo,ReqCount,NoCE, ... 

%  ReadyN,DataN,CPCN,DCountN,PDCN, ... 

%  DRAMResp,DRAMIssued); 

% 

%  See  definitions  in  Chapter  V,  Section  B,  Subsection  1)  for  definitions 

% 

function  [ReadyN,DataN,CPCN,DCountN,PDCN, ... 
DRAMResp,DRAMIssued]=  ... 

ser_dmem (Address , RW , Ready, Data, N AC ,C PC ,OC , . . . 

DCount,PDC,BankNo,ReqCount,NoCE,  ... 
ReadyN,DataN,CPCN,DCountN,PDCN,  ... 
DRAMResp,DRAMIssued) 

SDRC  =  ~PDC(BankNo)  &  (CPC(BankNo)  ~=  NAC(BankNo))  &  ... 
(RW(CPC(BankNo))==1); 

SDWC  =  ~PDC(BankNo)  &  (CPC(BankNo)  ~=  NAC(BankNo))  &  ... 

(RW(CPC(BankNo))==0); 
if  (SDRC==1), 

DCountN(BankNo)  =  1; 

PDCN(BankNo)  =  1; 

DRAMIssued(:,BankNo)  =  [1;  BankNo;  1]; 
elseif  (SDWC==1),  : 

DCountN(BankNo)  =  1 ; 

PDCN(BankNo)  =  1 ; 

DRAMIssued(:, BankNo)  =  [1;  BankNo;  0]; 
elseif  PDC(BankNo)==1 , 

DCountN  (BankNo)  =  DCount(BankNo)  +  1; 
if  DCountN(BankNo)==ReqCount, 

DataN(CPC(BankNo),BankNo)  =  77777; 

ReadyN(CPC(BankNo), BankNo)  =  1; 

CPCN(BankNo)  =  modulol  (CPC(BankNo)+1  ,NoCE); 
PDCN(BankNo)  =  0; 

DRAMResp(:,BankNo)  =  [1;  BankNo]; 
end;  %  if 
end;  %  elseif 
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%  File  Name: 

%  Description: 
%  Programmer: 
%  Date  Mod: 

% 


ser_memr.m 

Service  memory  requests  from  the  processor. 
Raymond  F.  Bernstein  Jr. 

14  Jun  95  Added  address  to  ReqAllowed 


%  function  []  ser_memr(lndex,Address,RW, Ready, Data, NAC, CPC,  ... 

%  OC,DCount,Empty,PDC,GRI,REI,BS,ReqC,ResC,NoCE); 

% 

%  See  definitions  in  Chapter  V,  Section  B,  Subsection  1)  for  definitions 

% 

function  [lndexN,AddressN,RWN,ReadyN,... 

DataN,NACN,ReqCN, ReqAllowed]  =  ... 
ser_memr{lndex,Address,RW,Ready,Data,NAC,CPC,OC,GRI,... 
BS,ReqC,ReqAllowed,BankNo,Addr,BankSelNo,RWFIag,  ... 
NoCE,lndexN,AddressN,RWN,ReadyN,DataN,NACN,ReqCN) 
if  BankSelNo>=0, 

if  (GRI(BankSelNo)==1)  &  (BankSelNo==BankNo), 

lndexN(NAC(BankNo),BankNo)  =  ReqC(BankNo); 
AddressN(NAC(BankNo),BankNo)  =  Addr; 
RWN(NAC(BankNo),BankNo)  =  RWFlag; 
DataN(NAC(BankNo),BankNo)  =  Addr; 
ReadyN(NAC(BankNo),BankNo)  =  0; 

ReqCN(BankNo)  =  ReqC(BankNo)  +  1 ; 

NACN(BankNo)  =  modulo1(NAC(BankNo)+1,NoCE); 
ReqAllowed  =  [1  BankNo  Addr]; 
elseif  GRI(BankSelNo)==1 , 

ReqCN(BankNo)  =  ReqC(BankNo)  +  1 ; 

end; 

end; 
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%  File  Name:  ser_memr.m 

%  Description:  Service  memory  requests  from  the  processor. 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  14Jun95  Added  address  to  ReqAllowed 

% 

%  function  []  ser_memr(lndex, Address, RW, Ready, Data, NAC, CPC, ... 

%  OC,DCount,Empty,PDC,GRI,REI,BS,ReqC,ResC,NoCE); 

% 

%  See  definitions  in  Chapter  V,  Section  B,  Subsection  1)  for  definitions 
% 

function  [lndexN,AddressN,RWN,ReadyN,DataN,... 

NACN,ReqCN, ReqAllowed]  = ... 

ser_memr(lndex,Address,RW,Ready,Data,NAC,CPC,OC,GRI,... 

BS,ReqC,ReqAllowed,BankNo,Addr,BankSelNo,RWFIag,NoCE, ... 
lndexN,AddressN,RWN,ReadyN,DataN,NACN,ReqCN) 
if  BankSelNo>=0, 

if  (GRI(BankSelNo)==1)  &  (BankSelNo==BankNo), 

lndexN(NAC(BankNo),BankNo)  =  ReqC(BankNo); 
AddressN(NAC(BankNo),BankNo)  =  Addr; 
RWN(NAC(BankNo),BankNo)  =  RWFlag; 
DataN(NAC(BankNo),BankNo)  =  Addr; 
ReadyN(NAC(BankNo),BankNo)  =  0; 

ReqCN(BankNo)  =  ReqC(BankNo)  +  1 ; 

NACN(BankNo)  =  modulo!  (NAC(BankNo)+1  ,NoCE); 

ReqAllowed  =  [1  BankNo  Addr]; 
elseif  GRI(BankSelNo)==1, 

ReqCN(BankNo)  =  ReqC(BankNo)  +  1 ; 

end; 

end; 
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%  File  Name:  sim_comp.m 

%  Description:  Evaluate  if  the  simulation  is  completed 

%  Programmer:  Raymond  F.  Bernstein  Jr. 

%  Date  Mod:  03  Mar  95 

% 

%  function  done  =  sim_comp(LastAddr, Empty) 

%  where 

%  done  1  -  Indicates  the  the  simulation  is  complete 

%  0  -  Indicates  it  is  not  complete 

%  LastAddr  1  -  Indicates  more  than  one  more  memory  references  are 

%  to  come 

%  0  -  Indicates  the  last  memory  reference  is  being  requested 

%  1  -  Indicates  no  more  memory  references  will  be  requested 

%  Empty  Cache  Element  array  indicating  whether  a  memory  bank 

%  is  empty  (i.e.,  no  memory  requests  are  pending  to  be 

%  processed. 


% 


function  done  =  sim_comp(LastAddr, Empty) 


AIIEmpty  =  min(Empty); 
done  =  AIIEmpty  &  (LastAddr<0); 
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%  File  Name: 

%  Description: 
% 

%  Programmer: 
%  Date  Mod: 

% 


m_anal.m 

Organize  data  in  graphical  form  for  analysis  of  memory 
data  from  stm. 

Raymond  F.  Bernstein  Jr. 

29  Oct  95 


%  function  [TP,S,MaxL,AvgL,StdUSSTP,TR]  = 

%  m_anal(fname, ASCII, Apattern.WinLen.PlotFlag, Length, PrintFlag) 
%  where 


% 

fname 

Name  of  the  file  containing  data  produced  by  stm 

% 

ASCII 

Indicates  whether  fname  is  stored  as  ASCII  or  binary 

% 

0  -  Binary 

% 

1  -  ASCII 

% 

Apattern 

Short  description  of  the  Address  pattern 

% 

WinLen 

Length  of  the  smoothing  window  for  throughput 

% 

PlotFlag 

Specifies  the  number  and  types  of  plots 

% 

0  No  plot 

% 

1  One  plot 

% 

Length 

Specifies  #  pts  used  in  a  plot 

% 

PrintFlag 

0  -  Print  to  Screen 

% 

1  -  Print  to  postscript  file 

% 

2  -  Print  directly  to  default  printer 

function  [TP,S,MaxL,AvgL,StdL,SSTP,TR]  =  ... 

m_anal(fname,ASCIl,Apattern,WinLen,PlotFlag, Length, PrintFlag) 


if  (ASCII<0)l(ASCII>1)l(WinLen<0)l(PlotFlag<0)l... 
(PlotFlag>1)l(Length<0)l(PrintFlag>2)l(PrintFlag<0), 
fprintf(1  ,'m_anal::lncorrect  parameters!!!\n'); 
exit; 

end; 


%  Read  data  in  from  the  file 
if  ASCII, 

fnamel  =  [fname,  '.gr']; 
fid  =  fopen(fname1  ,'rt'); 

NoBanks  =  fscanf(fid,'%d',1); 

NoCE  =fscanf(fid,'%d’,1); 

MemRatio  =  fscanf(fid,'%d',1); 

[Data, COUNT]  =  fscanf(fid,'%d',inf); 
fclose{fid); 

NoRows  =  COUNT/9; 
for  i=1:NoRows, 

DAry(i,1 :9)=  Data((i-1  )*9+1  :(i-1  )*9+9)'; 

end; 
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else 


fnamel  =  [fname,  '.gr1']; 
fid  =  fopen(fname1  .V); 

Tmp  =  fread(fid,3,'long'); 

NoBanks  =  Tmp(1); 

NoCE  =  Tmp(2); 

MemRatio  =  Tmp(3): 
fclose(fid); 

fnamel  =  [fname,  '.gr2']: 
fid  =  fopen(fname1  ,V); 

[Data,  COUNT]  =  fread(fid,inf,'long'); 
fclose(fid); 

NoRows  =  COUNT/9; 
for  i=1:9, 

DAry(1:NoRows,i)  = ... 

Data((i-1  )*NoRows+1  :(i-1  )*NoRows+NoRows); 

end; 

end; 

%  Calculate  the  Latency 
%  Handle  the  first  one  seperate 
CAddr=  DAry(1,6); 
k  =  1; 

while  DAry(k,8)~=CAddr, 
k  =  k+1 ; 
end;  %while 
OAry(1,1)  =  k; 

LastLatency  =  OAry(1 ,1); 

%  Now  do  the  remaining  rows 
for  i=2:NoRows, 

if  DAry(i,5)==-1 , 

OAry(i,1)  =  LastLatency; 
elseif  DAry(i,5)  ==  DAry(i-1,5), 

OAry(i,1)  =  LastLatency; 

else 

k=i; 

while  (DAry(k,6)==-1), 
k  =  k  +  1; 

end; 

CIndex  =  DAry(k,6); 
while  DAry(k,8)~=Clndex, 
k  =  k+1 ; 
end;  %while 
OAry(i,1)  =  k-i+1; 

LastLatency  =  OAry(i,1); 

end; 
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OAty(i,2)=DAry(i,4);  %  Request  Allowed 
OAty(i,3)=DAry(i,7);  %  Memory  Response 
end;  %for 

MaxL=max(OAry(:,  1 )); 

AvgL=mean(OAry(:,1)); 

StdL=std(OAry(;,1)): 

%%%  Throughput  %%% 

SmoothWin  =  ones(WinLen,1); 

Throughput  =  conv(SmoothWin,OAry(:,3))A/\/inLen; 
OAry(;,4)  =  ThroughPut(1 :  length  (OAry)); 

TotalThroughPut  =  sum(OAry(:,3))/length(OAry); 

TP  =  TotalThroughPut; 

%%%  Calculate  length  of  effective  response  %%% 
%%%%%%%%%%%%%%%%%%%%%%%%% 
if  Length==0  I  Length>length(OAry), 

Length  =  length(OAry); 

XAxisLbl  =  1 :  Length; 

else 

XAxisLbl  =  1  :Length;  %  Use  user  specified  length 

end; 

%%%  Steady  State  Throughput  %%% 

%%%  Transient  Time  %%%%%%%% 

SSTP  =  OAry(1,4); 

TR  =  1; 
for  i=2:Length, 

if  OAry(i,4)~=SSTP, 

SSTP  =  OAry(i,4); 

TR  =  i; 

end; 

end; 

TR  =  TR- WinLen+1; 

%%%  Check  validity  of  SSTP  %%% 
if  (TR>=0.5*Length) 

SSTP  =  mean(OAry(0.50*Length:0.75*Length,4)); 

end; 

if  Length==0  I  Length>length(OAry), 

Length  =  length(OAry); 

XAxisLbl  =  1  :Length; 

else 

XAxisLbl  =  1  iLength;  %  Use  user  specified  length 

end; 
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%%%  Speed  Up  %%% 

S  =  TotalThroughPut*MemRatio: 


%%%  Clean  Up  %%% 

0Aty(:,2)  =  OAry(:,2)*0.25;  %  Request  Allowed  (GR) 
OAry(:,3)  =  OAry(:,3)*0.50;  %  Memory  Response  (RE) 

%%%  Graphics  Plot  %%% 
if  Length==0  I  Length>length(OAry), 

Length  =  length(OAry); 

XAxisLbl  =  1  iLength; 

else 

XAxisLbl  =  liLength;  %  Use  user  specified  length 

end; 


if  (PlotFlag==0), 

%  Do  nothing 
else  %  Plot  one  figure 
if  (PrintFlag==0), 
figure; 

end; 

subplot(3,1,1); 

plot(XAxisLbl,OAry(1:Length,1));grid; 

ylabel('Latency'); 

title(['Plot  ID:  '.Apattern, ... 

'#  Banks=',num2str(NoBanks), ... 

■  #  CEs=',num2str(NoCE), ... 

'  Mem  Ratio=’,num2str(MemRatio)]); 
axis([0  Length  0  MaxL*1.2]); 
subplot(3,1,2); 

plot(XAxisLbl,OAry(1:Length,4));grid; 
ylabel(Throughput'); 
title(['S=',num2str(S), ... 

'Avg  TP=',num2str(TotalThroughPut), ... 
'MaxL=‘,num2str(MaxL), ... 
'AvgL=‘,num2str(AvgL), ... 
'StdL=',num2str(StdL)]); 
axis([0  Length  0  1 .2]); 
subplot(3,1,3); 

plot(XAxisLbl,OAry(1:Length,2:3));grid; 
xlabel(Time  (Cycles)'); 
ylabel('STM  Status'); 
title(['SSTP=',num2str(SSTP), ... 
'TR=',num2str(TR)]); 
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axis([0  Length  0  0.6]); 
ah  =  gca; 

set(ah,'YTick’,[0;  0.25;  0.5]) 
set(ah,'YTickLabels',['  ’;'GR':'RE']); 

If  (PrintFlag==1), 

set(gcf,'PaperPosition',[0.25  2.5  5.8  8.2]); 
eval(['print  ',fname,'  -deps2']); 
title(['S=',num2str(S), ... 

‘Avg  TP=’,num2str(TotalThroughPut), ... 
'MaxL=',num2str(MaxL), ... 
'AvgL=',num2str(AvgL),  ... 
■StdL=',num2str(StdL)]); 

end; 

if  (PrintFlag==2), 
orient  tall 
print; 

end; 

end; 
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